Skip to content

Fix: return 404 when attempting invalid listing in dev environment#4862

Open
mvacula02 wants to merge 9 commits intomainfrom
mvacula/aro-25089
Open

Fix: return 404 when attempting invalid listing in dev environment#4862
mvacula02 wants to merge 9 commits intomainfrom
mvacula/aro-25089

Conversation

@mvacula02
Copy link
Copy Markdown
Collaborator

@mvacula02 mvacula02 commented Apr 13, 2026

fixes ARO-25089 and enables ARO-22570, ARO-22571

What

Implement a fix to error handling after invalid listing. Currently, an internal server occurs instead of the expected ResourceNotFound or similar in the dev environment. Investigation shows that frontend's writeError couldn't extract a resource ID from list endpoint URLs. PR currently only enables the test cases to observe the CI behaviour first. Afterwards a bug fix will be implemented.

The fix adds a ResourceGroupChecker that queries Azure for resource group existence. In the dev environment, there is no ARM proxy layer in front of the RP to validate resource groups before requests reach the frontend. When a list handler encounters a "not found" condition (empty results for clusters, or Cosmos 404 for the parent cluster in node pool/external auth cases), it now checks whether the resource group actually exists. If not, it returns ResourceGroupNotFoundError (404). If the resource group exists but the parent cluster doesn't (node pool and external auth listing), it returns ParentResourceNotFoundError (404). These are proper *CloudError values that the error handler (internal/errorutils/error.go) handles directly, avoiding the ResourceIDFromContext fallback that fails on list endpoints (where no single resource ID exists in the context) and causes an internal server error. The behaviour should now be consistent with higher environments.

Front end changes will not be implemented, instead The PR adds an armResourceGroupValidationPolicy to the E2E test harness that simulates ARM's resource group validation in development environment. For every request targeting the RP frontend, the policy checks whether the resource group exists in Azure and returns a 404 if it doesn't, the same behavior ARM provides in production. This enables the ARO-22570 and ARO-22571 test cases for listing clusters and node pools in nonexistent resource groups, without any changes to the frontend.

Scenario removed: listing a resource with non-existent parent. Removed as it can't be implemented without significant front end changes.

Note: tests will be enabled in #4876 as the frontend changes need to populate first #4876 will be closed as the tests can be enabled within the PR as the fix now

Why

Implement missing error handling and unblock negative listing cases both for clusters and node pools.

Testing

Tests were initially blocked by the bug. Enabling the tests again and having the run succeed should warrant the bug fix to work.

Bug fix was verified through successful listing test runs in personal dev environment with a locally built frontend.

enable negative listing test cases initially to observe behaviour in CI
https://redhat.atlassian.net/browse/ARO-25089
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 13, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

mvacula02 commented Apr 13, 2026

  • Remove MustFilter after test validation in Stage

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test stage-e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test stage-e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test stage-e2e-parallel

@mvacula02 mvacula02 changed the title Fix: return 404 when attempting invalid listing Fix: return 404 when attempting invalid listing in dev environment Apr 14, 2026
@mvacula02 mvacula02 marked this pull request as ready for review April 14, 2026 16:13
Copilot AI review requested due to automatic review settings April 14, 2026 16:13
@openshift-ci openshift-ci bot requested review from deads2k and geoberle April 14, 2026 16:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes dev-environment listing endpoints to return appropriate 404 ARM *CloudError responses (instead of 500s) when the resource group or parent resource does not exist.

Changes:

  • Add Azure ARM-backed ResourceGroupChecker and wire it into Frontend.
  • Introduce new ARM error helpers for resource-group and parent-resource “not found” cases.
  • Update cluster / node pool / external auth list handlers to return 404s with correct CloudError payloads.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
test-integration/utils/integrationutils/utils.go Updates frontend construction to pass the new resourceGroupChecker parameter.
test-integration/go.mod Adds ARM resources SDK dependency used for RG existence checks.
internal/api/arm/error.go Adds new CloudError code + constructors for RG-not-found and parent-not-found.
frontend/pkg/frontend/testhelpers.go Updates test frontend constructor for new parameter.
frontend/pkg/frontend/resource_group.go Adds ResourceGroupChecker + Azure implementation using ARM ResourceGroups client.
frontend/pkg/frontend/node_pool.go Returns 404 CloudErrors for RG-not-found / parent-cluster-not-found during node pool list.
frontend/pkg/frontend/frontend_test.go Updates test frontend instantiations for new parameter.
frontend/pkg/frontend/frontend.go Extends Frontend to hold a ResourceGroupChecker and plumbs it through constructor.
frontend/pkg/frontend/external_auth.go Returns 404 CloudErrors for RG-not-found / parent-cluster-not-found during external auth list.
frontend/pkg/frontend/cluster.go Checks RG existence when cluster list is empty to distinguish RG-not-found from empty list.
frontend/cmd/cmd.go Instantiates DefaultAzureCredential + RG checker and injects it into the frontend.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread frontend/cmd/cmd.go Outdated
Comment thread frontend/pkg/frontend/resource_group.go Outdated
Comment thread frontend/pkg/frontend/resource_group.go Outdated
Comment thread internal/api/arm/error.go Outdated
Comment thread frontend/pkg/frontend/node_pool.go Outdated
Comment thread frontend/pkg/frontend/cluster.go Outdated
Copy link
Copy Markdown
Collaborator

@mbarnes mbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your pull request description you say:

In the dev environment, there is no ARM proxy layer in front of the RP to validate resource groups before requests reach the frontend.

I understand that to mean azureResourceGroupChecker is only applicable to development environments, not to environments where ARM is present because a collection GET request for a non-existent resource group would never even get routed to the RP.

I think this is important to capture in the source code so it doesn't get removed down the road by someone thinking "this can never happen in production, why do we need it?"

Could you add a comment somewhere explaining this? Wherever you think appropriate.

Copilot AI review requested due to automatic review settings April 16, 2026 10:07
@mvacula02 mvacula02 requested a review from mbarnes April 16, 2026 10:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread frontend/cmd/cmd.go Outdated
Comment thread frontend/pkg/frontend/resource_group.go Outdated
Comment thread frontend/pkg/frontend/resource_group.go Outdated
@geoberle
Copy link
Copy Markdown
Collaborator

i don't feel comfortable with this change. we are moving something that ARM is doing into the RP and it is painful for the things where we do it.
have you considered replicating ARMs behavior in our "AroRpApiCompatible" test harness? (armSystemDataPolicy)

@bennerv @mbarnes wdyt?

@mvacula02
Copy link
Copy Markdown
Collaborator Author

have you considered replicating ARMs behavior in our "AroRpApiCompatible" test harness? (armSystemDataPolicy)

I believe that would be possible only for the non-existent listings. The case, where a parent resource is non-existent would need to be dropped. That is fine with me if you think the current approach is too much

@mbarnes
Copy link
Copy Markdown
Collaborator

mbarnes commented Apr 16, 2026

have you considered replicating ARMs behavior in our "AroRpApiCompatible" test harness? (armSystemDataPolicy)

If the negative test cases you've added to #4876 are deemed worth keeping, maybe just split them out in such a way that they don't have the AroRpApiCompatible label since they're not actually compatible without ARM present.

Those cases would then be testing more ARM behavior than RP behavior, so maybe it's appropriate to drop them entirely.

Copilot AI review requested due to automatic review settings April 16, 2026 12:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/util/framework/per_invocation_framework.go
Comment thread test/util/framework/azure_client_policies.go
Comment thread test/e2e/simple_negative_cases.go
@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

// parseResourceGroupFromPath extracts the subscription ID and resource group
// name from a URL path like /subscriptions/{subId}/resourceGroups/{rgName}/...
func parseResourceGroupFromPath(path string) (subscriptionID, resourceGroupName string) {
parts := strings.Split(strings.ToLower(path), "/")
Copy link
Copy Markdown
Collaborator

@mgahagan73 mgahagan73 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using strings.EqualFold() here. You will be able to drop the ToLower() step as it is case insensitive. For example:

parts := strings.Split(path, "/")
for i, part := range parts {
if strings.EqualFold(part, "subscriptions") && i+1 < len(parts) {
subscriptionID = parts[i+1]
}
if strings.EqualFold(part, "resourcegroups") && i+1 < len(parts) {
resourceGroupName = parts[i+1]
}
}

_, err = client.Get(req.Raw().Context(), rgName, nil)
if err != nil {
var respErr *azcore.ResponseError
if errors.As(err, &respErr) && respErr.ErrorCode == "ResourceGroupNotFound" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the literal "ResourceGroupNotFound" with arm.CloudErrorCodeResourceGroupNotFound to be consistent with the rest of the code.

@mgahagan73
Copy link
Copy Markdown
Collaborator

In addition to previous comments I think might want to add a unit test for the ARM simulator bits in parseResourceGroupFromPath() and other functions in azure_client_policies.go . Adding it in separate PR would be fine.

@mvacula02 mvacula02 requested a review from mgahagan73 April 17, 2026 08:00
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 17, 2026

@mvacula02: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/cspr da6349e link true /test cspr

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@mvacula02
Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@mgahagan73
Copy link
Copy Markdown
Collaborator

/approve

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgahagan73, mvacula02

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants