Remove kube-rbac-proxy and replace with native metrics auth#73
Remove kube-rbac-proxy and replace with native metrics auth#73shreyabiradar07 wants to merge 11 commits intokruize:mvp_demofrom
Conversation
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Reviewer's GuideReplaces the external kube-rbac-proxy sidecar with controller-runtime’s native secure metrics server configuration, centralizing TLS/HTTP2 handling, locking metrics to HTTPS on 8443, granting TokenReview/SAR RBAC, updating tests and manifests, and simplifying CI and kustomize defaults accordingly. Sequence diagram for authenticated metrics scrape via controller_runtime filterssequenceDiagram
actor Admin
participant Prometheus
participant KubeAPIServer
participant Manager as KruizeOperatorManager
participant Authn as TokenReviewAPI
participant Authz as SubjectAccessReviewAPI
Admin->>Prometheus: Configure scrape target <svc>:8443/metrics
Prometheus->>KubeAPIServer: HTTPS GET /api/v1/namespaces/ns/services/svc:8443/proxy/metrics (Bearer token)
KubeAPIServer->>Manager: Forward HTTPS /metrics request
Manager->>Authn: Create TokenReview
Authn-->>Manager: TokenReview status (authenticated/unauthenticated)
alt token_invalid_or_unauthenticated
Manager-->>Prometheus: 401 Unauthorized
else token_authenticated
Manager->>Authz: Create SubjectAccessReview
Authz-->>Manager: SAR status (allowed/denied)
alt access_denied
Manager-->>Prometheus: 403 Forbidden
else access_allowed
Manager-->>Prometheus: 200 OK + metrics
end
end
Class diagram for metrics TLS and authentication utilitiesclassDiagram
class Utils {
<<package>>
}
class MetricsOptionsUtil {
<<utility>>
+string DefaultMetricsPort
+string DefaultMetricsAddr
+string LocalMetricsAddr
+GetTLSOpts(enableHTTP2 bool) []func(tls_Config)
+GetMetricsOptions(addr string, secure bool, enableHTTP2 bool) metricsserver_Options
}
class Main {
<<executable>>
-string metricsAddr
-string probeAddr
-bool enableLeaderElection
-bool secureMetrics
-bool enableHTTP2
+main()
}
class MetricsServerOptions {
<<from_controller_runtime>>
+string BindAddress
+bool SecureServing
+FilterProvider FilterProvider
+[]func(tls_Config) TLSOpts
}
class Filters {
<<from_controller_runtime>>
+WithAuthenticationAndAuthorization() FilterProvider
}
Utils <|-- MetricsOptionsUtil
Main ..> MetricsOptionsUtil : uses
MetricsOptionsUtil ..> MetricsServerOptions : constructs
MetricsOptionsUtil ..> Filters : uses
%% Represent tls.Config and metricsserver.Options without methods
class tls_Config {
<<struct>>
+[]string NextProtos
}
class metricsserver_Options {
<<alias_of_MetricsServerOptions>>
}
MetricsServerOptions <|-- metricsserver_Options
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The RBAC for tokenreviews/subjectaccessreviews is added both via kubebuilder markers and by hand in config/rbac/role.yaml; since role.yaml is generated, consider removing the manual edits and regenerating manifests so RBAC stays source-of-truth in the markers.
- The go.mod change pulls in a large number of new indirect dependencies (CEL, gRPC, OTel, etc.); it would be good to verify they are actually required for the metrics auth change and run
go mod tidyto drop anything unnecessary and keep the dependency surface minimal. - In config/rbac/kustomization.yaml and config/default/kustomization.yaml, the comments still talk about enabling/disabling the auth proxy even though kube-rbac-proxy has been removed; updating these comments to describe the new native metrics auth behavior will make the configuration more self-explanatory.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The RBAC for tokenreviews/subjectaccessreviews is added both via kubebuilder markers and by hand in config/rbac/role.yaml; since role.yaml is generated, consider removing the manual edits and regenerating manifests so RBAC stays source-of-truth in the markers.
- The go.mod change pulls in a large number of new indirect dependencies (CEL, gRPC, OTel, etc.); it would be good to verify they are actually required for the metrics auth change and run `go mod tidy` to drop anything unnecessary and keep the dependency surface minimal.
- In config/rbac/kustomization.yaml and config/default/kustomization.yaml, the comments still talk about enabling/disabling the auth proxy even though kube-rbac-proxy has been removed; updating these comments to describe the new native metrics auth behavior will make the configuration more self-explanatory.
## Individual Comments
### Comment 1
<location path="config/default/kustomization.yaml" line_range="31-33" />
<code_context>
# Protect the /metrics endpoint by putting it behind auth.
# If you want your controller-manager to expose the /metrics
# endpoint w/o any authn/z, please comment the following line.
-- path: manager_auth_proxy_patch.yaml
</code_context>
<issue_to_address>
**suggestion:** Update the metrics auth comment now that the auth proxy patch line has been removed
This comment still instructs readers to “comment the following line,” but the `manager_auth_proxy_patch.yaml` entry has been removed rather than commented out. To avoid confusion, please update or remove this block so it matches the current setup, where metrics are protected via controller-runtime filters instead of the auth proxy patch.
```suggestion
# /metrics is protected by controller-runtime authentication/authorization filters.
# If you want your controller-manager to expose the /metrics endpoint without authn/z,
# update the manager configuration (e.g. flags or filters) instead of using an auth proxy patch.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The envtest suite now hardcodes a separate manager with metrics on 127.0.0.1:8443; consider refactoring to share the same metrics configuration/options construction as cmd/main.go to avoid drift between test and production setups.
- A large number of new indirect dependencies were added to go.mod; it may be worth checking whether they all stem from an intentional library upgrade (e.g., controller-runtime) and pruning any that are no longer required to keep the module graph minimal.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The envtest suite now hardcodes a separate manager with metrics on 127.0.0.1:8443; consider refactoring to share the same metrics configuration/options construction as cmd/main.go to avoid drift between test and production setups.
- A large number of new indirect dependencies were added to go.mod; it may be worth checking whether they all stem from an intentional library upgrade (e.g., controller-runtime) and pruning any that are no longer required to keep the module graph minimal.
## Individual Comments
### Comment 1
<location path="internal/controller/kruize_controller_test.go" line_range="810" />
<code_context>
+ })
+
+ Context("when accessing the metrics endpoint", func() {
+ It("should reject unauthorized requests with 401 or 403", func() {
+ By("calling the metrics endpoint without an Authorization header")
+ resp, err := client.Get("https://127.0.0.1:8443/metrics")
</code_context>
<issue_to_address>
**suggestion (testing):** Avoid potential flakiness by waiting for the metrics server to be ready before asserting on the response
This test calls `https://127.0.0.1:8443/metrics` and asserts immediately, but the manager/metrics server are started asynchronously in `BeforeSuite`, so there’s a race where the server may not be listening yet, causing intermittent connection errors. To avoid flakiness, wrap the request and status-code assertions in an `Eventually` that retries until a successful connection (or timeout).
</issue_to_address>
### Comment 2
<location path="internal/controller/kruize_controller_test.go" line_range="836" />
<code_context>
+ defer resp.Body.Close()
+
+ By("verifying the metrics server intercepted the request")
+ // Accepting 401/403 (Real Cluster) or 500 (EnvTest limitation)
+ // All three prove the filter is protecting the endpoint.
+ Expect(resp.StatusCode).To(SatisfyAny(
</code_context>
<issue_to_address>
**suggestion (testing):** The acceptance of HTTP 500 as a passing condition may make the test less meaningful
Treating `500 Internal Server Error` as success weakens the assertion, since it might reflect a real bug rather than just an envtest limitation. If envtest always returns 500 for invalid tokens, consider detecting that mode explicitly (e.g., build tag or env var) and branching expectations, or at least asserting something envtest-specific in the response (like a known error message) so genuine internal failures aren’t silently accepted.
Suggested implementation:
```golang
"net/http"
"os"
```
```golang
By("verifying the metrics server intercepted the request")
if os.Getenv("ENVTEST_MODE") == "true" {
// In envtest mode we explicitly expect the known 500 behavior,
// rather than treating any 500 as success.
Expect(resp.StatusCode).To(
Equal(http.StatusInternalServerError),
"Expected 500 from envtest when handling unauthenticated request",
)
} else {
Expect(resp.StatusCode).To(SatisfyAny(
Equal(http.StatusUnauthorized),
Equal(http.StatusForbidden),
), "Expected 401 or 403 for unauthenticated request")
}
```
To fully implement the intent of your review comment, you should make a similar change in the `"should fail when an invalid token is provided"` test: instead of accepting 500 along with 401/403, branch on the same `ENVTEST_MODE` (or a better-suited name if your project already uses one) and:
1. Expect exactly `http.StatusInternalServerError` in envtest mode.
2. Otherwise, expect 401/403 (or the appropriate non-envtest statuses for invalid tokens).
This ensures 500 is only treated as success when you're explicitly running under the envtest behavior, and avoids silently passing on genuine internal server errors.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
|
@sourcery-ai review |
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The metrics port and bind address are now hard-coded as 8443 in several places (flags in main, manager.yaml args/ports, envtest manager, and metrics tests); consider centralizing this value in one place or using shared constants to avoid config drift in future changes.
- GetMetricsOptions always configures FilterProvider even when secureServing is disabled; if you intend to support an unprotected or plain HTTP metrics endpoint, you may want to make the filters conditional on the secure flag to avoid unexpected authn/z behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The metrics port and bind address are now hard-coded as 8443 in several places (flags in main, manager.yaml args/ports, envtest manager, and metrics tests); consider centralizing this value in one place or using shared constants to avoid config drift in future changes.
- GetMetricsOptions always configures FilterProvider even when secureServing is disabled; if you intend to support an unprotected or plain HTTP metrics endpoint, you may want to make the filters conditional on the secure flag to avoid unexpected authn/z behavior.
## Individual Comments
### Comment 1
<location path="internal/controller/kruize_controller_test.go" line_range="856-864" />
<code_context>
+ }
+ })
+
+ It("should have the metrics server running and reachable", func() {
+ By("checking the connection to the secure port")
+ resp, err := client.Get("https://127.0.0.1:8443/metrics")
+ Expect(err).ToNot(HaveOccurred())
+ defer resp.Body.Close()
+
+ // Reachability test: getting any HTTP response proves the server is listening
+ Expect(resp).ToNot(BeNil())
+ Expect(resp.StatusCode).ToNot(Equal(0))
+ })
+ })
</code_context>
<issue_to_address>
**suggestion (testing):** Use Eventually/Consistently to avoid flakiness when checking metrics server reachability
This test performs a single GET to `https://127.0.0.1:8443/metrics` and assumes the metrics server is already listening. Because the manager starts in `BeforeSuite`, this can intermittently fail with connection errors/EOFs. Consider wrapping the GET in `Eventually` (as in the first test in this context) to retry until you get any non-zero status code, or using `Consistently` after an `Eventually`-based readiness check to keep the reachability intent while avoiding flakiness.
Suggested implementation:
```golang
It("should have the metrics server running and reachable", func() {
checkMetricsReachable := func() error {
resp, err := client.Get("https://127.0.0.1:8443/metrics")
if err != nil {
return err
}
defer resp.Body.Close()
// Reachability test: getting any HTTP response proves the server is listening
if resp == nil {
return fmt.Errorf("expected non-nil response from metrics endpoint")
}
if resp.StatusCode == 0 {
return fmt.Errorf("expected non-zero HTTP status code from metrics endpoint")
}
return nil
}
By("waiting for the metrics server to become reachable on the secure port")
Eventually(checkMetricsReachable).Should(Succeed())
By("verifying the metrics server remains reachable on the secure port")
Consistently(checkMetricsReachable).Should(Succeed())
})
```
1. Ensure `fmt` is imported at the top of `internal/controller/kruize_controller_test.go`:
- If not present, add `fmt` to the existing import block, e.g.:
`import ( ... "fmt" ... )`.
2. If other tests in this file use custom timeouts/intervals for `Eventually`/`Consistently`, consider aligning this test with those by adding explicit duration/interval arguments to both calls (e.g. `Eventually(checkMetricsReachable, 10*time.Second, 250*time.Millisecond)`), and ensure `time` is imported if not already.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The
Metrics Auth Integrationtests currently treat HTTP 500 as an acceptable outcome for unauthorized access in the firstEventuallycheck; this can mask real regressions in the metrics auth path and might be worth tightening to only the expected 401/403 (and explicitly-scoped 500 in envtest where needed). - The new metrics auth wiring pulled in a large set of indirect dependencies (e.g., OpenTelemetry, grpc-gateway, cel-go); it would be good to double-check which of these are actually required by controller-runtime filters and prune any that are not strictly needed to keep the module lean.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `Metrics Auth Integration` tests currently treat HTTP 500 as an acceptable outcome for unauthorized access in the first `Eventually` check; this can mask real regressions in the metrics auth path and might be worth tightening to only the expected 401/403 (and explicitly-scoped 500 in envtest where needed).
- The new metrics auth wiring pulled in a large set of indirect dependencies (e.g., OpenTelemetry, grpc-gateway, cel-go); it would be good to double-check which of these are actually required by controller-runtime filters and prune any that are not strictly needed to keep the module lean.
## Individual Comments
### Comment 1
<location path="internal/controller/kruize_controller_test.go" line_range="811-820" />
<code_context>
+ })
+
+ Context("when accessing the metrics endpoint", func() {
+ It("should reject unauthorized requests", func() {
+ By("calling the metrics endpoint until it is ready")
+
+ Eventually(func() (int, error) {
+ metricsURL := fmt.Sprintf("https://%s/metrics", utils.LocalMetricsAddr)
+ resp, err := client.Get(metricsURL)
+ if err != nil {
+ return 0, err // Return error to trigger a retry
+ }
+ defer resp.Body.Close()
+ return resp.StatusCode, nil
+ }, "10s", "1s").Should(SatisfyAny(
+ Equal(http.StatusUnauthorized),
+ Equal(http.StatusForbidden),
</code_context>
<issue_to_address>
**suggestion (testing):** Avoid accepting 500 from the metrics endpoint in real-cluster mode when asserting unauthorized access
`Eventually` currently treats `StatusInternalServerError` as acceptable for unauthorized requests. That’s reasonable in envtest (authn/z webhooks can surface as 500), but in a real cluster a 500 usually indicates a misconfigured TokenReview/SAR or metrics filters. To avoid this test masking real production issues, branch on `KRUIZE_TEST_MODE` (as in the next test) and only permit 500 in envtest mode, while requiring 401/403 in real-cluster mode.
Suggested implementation:
```golang
Context("when accessing the metrics endpoint", func() {
It("should reject unauthorized requests", func() {
By("calling the metrics endpoint until it is ready")
testMode := os.Getenv("KRUIZE_TEST_MODE")
if testMode == "envtest" {
// In envtest, 500 may surface from authn/z webhooks, so allow it
Eventually(func() (int, error) {
metricsURL := fmt.Sprintf("https://%s/metrics", utils.LocalMetricsAddr)
resp, err := client.Get(metricsURL)
if err != nil {
return 0, err // Return error to trigger a retry
}
defer resp.Body.Close()
return resp.StatusCode, nil
}, "10s", "1s").Should(SatisfyAny(
Equal(http.StatusUnauthorized),
Equal(http.StatusForbidden),
Equal(http.StatusInternalServerError),
), "The metrics server should eventually be reachable and reject the unauthorized request in envtest")
} else {
// In real clusters, 500 usually indicates misconfiguration, so do not accept it
Eventually(func() (int, error) {
metricsURL := fmt.Sprintf("https://%s/metrics", utils.LocalMetricsAddr)
resp, err := client.Get(metricsURL)
if err != nil {
return 0, err // Return error to trigger a retry
}
defer resp.Body.Close()
return resp.StatusCode, nil
}, "10s", "1s").Should(SatisfyAny(
Equal(http.StatusUnauthorized),
Equal(http.StatusForbidden),
), "The metrics server should eventually be reachable and reject the unauthorized request in real-cluster mode")
}
})
It("should fail when an invalid token is provided", func() {
By("creating a request with an invalid bearer token")
metricsURL := fmt.Sprintf("https://%s/metrics", utils.LocalMetricsAddr)
req, err := http.NewRequest("GET", metricsURL, nil)
Expect(err).ToNot(HaveOccurred())
req.Header.Set("Authorization", "Bearer invalid-token")
```
1. Add the `os` package import at the top of `internal/controller/kruize_controller_test.go` if it is not already present:
<<<<<<< SEARCH
import (
"crypto/tls"
"fmt"
"net/http"
=======
import (
"crypto/tls"
"fmt"
"net/http"
"os"
>>>>>>> REPLACE
2. If the existing import block differs, adjust the `SEARCH` section accordingly so that `os` is included in the same grouped import block.
</issue_to_address>
### Comment 2
<location path="internal/controller/kruize_controller_test.go" line_range="829-838" />
<code_context>
+ It("should fail when an invalid token is provided", func() {
</code_context>
<issue_to_address>
**suggestion (testing):** Add readiness waiting to avoid flakes when testing invalid-token behavior
This test does a single `client.Do(req)` without any readiness wait. If the metrics server isn’t up yet, the call can fail with a transient connection error, making the test flaky and obscuring the invalid-token behavior. Please either reuse the readiness logic from the first test before sending the invalid-token request, or wrap `client.Do(req)` in an `Eventually` that tolerates transient connection errors and then asserts the final status code.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
Signed-off-by: Shreya Biradar <shbirada@ibm.com>
|
The image works as expected. |
This PR using native metrics authentication integrates TokenReview and SubjectAccessReview directly into the operator's Go binary via controller-runtime. This secures the
/metricsendpoint with HTTPS and RBAC internally, removing the deprecated external sidecar container like kube-rbac-proxy.Docker image:
quay.io/shbirada/kruize_operator:remove-kube-rbac-proxySummary by Sourcery
Replace the external kube-rbac-proxy sidecar with native controller-runtime metrics authentication and secure serving for the /metrics endpoint.
Bug Fixes:
Enhancements:
CI: