evals: resync analysis schema with operator (drop estimatedImpact, tighten verification)#80
evals: resync analysis schema with operator (drop estimatedImpact, tighten verification)#80harche wants to merge 1 commit into
Conversation
ANALYSIS_WITH_COMPONENTS_SCHEMA mirrors the operator's AnalysisOutputSchema. Sync it with the operator change in openshift/lightspeed-agentic-operator#174: - drop estimatedImpact (removed from the operator API and CRD) - require name+type on verification.steps[] items and description on the verification object, matching the operator schema / AnalysisResult CRD Verified by the live claude/Vertex eval (test_find_token_skill) and offline jsonschema checks. Pre-existing eval-specific divergences (token sentinels, lowercase enums, boolean reversible) are intentional and left unchanged.
|
Warning Review limit reached
More reviews will be available in 41 minutes and 49 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note: this eval fixture is a best-effort approximation of the operator's |
|
@harche: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
joshuawilson
left a comment
There was a problem hiding this comment.
Adversarial Review
The 3 code changes are mechanically correct and exactly mirror operator PR #174. No bugs. But the framing overclaims alignment, and test coverage is single-provider only.
Concern 1: "Resync" title is misleading
The PR title says "resync analysis schema with operator" but significant pre-existing divergences remain untouched — several more impactful than estimatedImpact:
| Path | Sandbox (eval) | Operator | Severity |
|---|---|---|---|
proposal.reversible |
{"type": "boolean"} |
{"type": "string", "enum": ["Reversible", "Irreversible", "Partial"]} |
Type mismatch |
proposal.risk enum |
["low", "medium", "high", "critical"] |
["Low", "Medium", "High", "Critical"] |
Casing mismatch |
diagnosis.confidence enum |
["low", "medium", "high"] |
["Low", "Medium", "High"] |
Casing mismatch |
rollbackPlan location |
Under verification.properties |
Under proposal.properties |
Structural mismatch |
diagnosis.required |
Includes "token" |
Does not | Extra required field |
options[].required |
Includes "rbac", "components" |
Does not | Extra required fields |
The reversible type mismatch is the most glaring: the eval asks models for a boolean while the operator expects one of three enum strings (Reversible/Irreversible/Partial). A model producing true/false for the eval would produce the wrong type at runtime.
Your comment acknowledges this is "a best-effort approximation" — which is fair, but then the PR title shouldn't say "resync." Consider: "evals: drop estimatedImpact, tighten verification required fields (match operator #174)".
Concern 2: Tightening verified with one provider only
The required additions are schema tightenings — previously valid responses become invalid. Verification was only against Claude (make eval EVAL_ARGS="-k claude"). If Gemini or OpenAI produce verification steps without name/type, those evals will now fail. Not blocking (evals are per-provider), but worth noting.
Concern 3: Operator PR #174 not merged
Both PRs are state: open. If #174 is revised (e.g., verification required changes), this PR would be pre-emptively tightened beyond what the operator enforces. Low risk since eval-only, but coordination matters.
Bottom line
Code changes are safe to merge. The diff correctly mirrors #174's three changes. Consider adjusting the title to not overclaim "resync" given the intentional divergences that remain.
joshuawilson
left a comment
There was a problem hiding this comment.
Inline comments on the three changed lines — see above for the full adversarial review.
| "enum": ["low", "medium", "high", "critical"], | ||
| }, | ||
| "reversible": {"type": "boolean"}, | ||
| "estimatedImpact": {"type": "string"}, |
There was a problem hiding this comment.
Removal is clean — estimatedImpact was never in proposal.required, so this is a pure relaxation. Models that still produce the field won't break since there's no additionalProperties: false.
Worth noting: two lines below this removal, reversible is {"type": "boolean"} while the operator defines it as {"type": "string", "enum": ["Reversible", "Irreversible", "Partial"]}. That's a bigger schema/operator divergence than estimatedImpact ever was — a model trained on this eval schema would produce the wrong type at runtime.
| "expected": {"type": "string"}, | ||
| "type": {"type": "string"}, | ||
| }, | ||
| "required": ["name", "type"], |
There was a problem hiding this comment.
This tightening was only verified with Claude (-k claude). Since output_schema is passed to each provider and post-hoc jsonschema.validate runs on the result, if Gemini or OpenAI omit name or type in step items, their evals will start failing after this merge. Might be worth a quick make eval across all providers, or at minimum noting this is a known risk.
| }, | ||
| }, | ||
| }, | ||
| "required": ["description"], |
There was a problem hiding this comment.
Operator PR #174 is still open. If it gets revised (e.g., this required constraint is dropped or changed), this sandbox PR would be pre-emptively tightened beyond what the operator enforces. Low risk since eval-only, but worth tracking — consider adding a comment like # Aligned with operator PR #174 so future readers know the provenance.
Resyncs
evals/schemas.py(ANALYSIS_WITH_COMPONENTS_SCHEMA, which mirrors the operator'sAnalysisOutputSchema) with the operator change in openshift/lightspeed-agentic-operator#174.Changes
estimatedImpactfromproposal— the operator removed it from the API type and CRD.required: ["name", "type"]toverification.steps[]items andrequired: ["description"]to theverificationobject, matching the operator schema and theAnalysisResultCRD.Why
The eval schema is sent as the
outputSchemato/runand used for the post-hocjsonschema.validate. Keeping it in lockstep with the operator ensures eval validation matches what the operator actually enforces, and avoids drift like theestimatedImpactschema/CRD mismatch that motivated #174.Verification
make eval EVAL_ARGS="-k claude"against Vertex / claude-opus-4-6 —test_find_token_skill[claude] PASSED. The model's native structured output conformed to the tightened schema (provider honors the nestedrequired) andjsonschema.validatepassed.description/name/type→ invalid;stepsstays optional).🤖 Generated with Claude Code