-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Error Summary
403 authentication errors are not handled as fatal errors in ckit_service_exec.py. When authentication fails, bots retry 3 times over 5 minutes before exiting, wasting resources and creating CrashLoopBackOff states.
Root Cause
Repository: smallcloudai/flexus-client-kit
File: flexus_client_kit/ckit_service_exec.py
Lines: 54-55
Function: run_typical_single_subscription_with_restart_on_network_errors
Why: When a 403 authentication error occurs, the code logs it but does not exit immediately (unlike 460 errors which call sys.exit(1)). Instead, the error is accumulated in the exception_times list and retried every 60 seconds until 3 exceptions occur within 5 minutes.
Authentication errors should be fatal and exit immediately since retrying won't fix invalid credentials.
Code Snippet
err_str = str(e)
if "460:" in err_str:
logger.error("%s", e)
sys.exit(1)
elif "403:" in err_str:
logger.error("Authentication failed - key doesn't work: %s", e)
# BUG: Should sys.exit(1) here
else:
logger.info("got %s (attempt %d/3), sleep 60...", type(e).__name__, len(exception_times))
await ckit_shutdown.wait(60)Git Blame
- Commit: 4983917
- Author: Kirill Starkov
- Date: 2025-12-17
- Message: "handle custom status codes"
Analysis
The root cause is a logic bug introduced in commit 4983917 where 403 authentication errors were given special logging but not made fatal like 460 errors. The authentication error originates from flexus_backend/flexus_v1/utils_superuser_password.py where it validates bot credentials.
When credentials are invalid, the bot should exit immediately rather than retry 3 times over 5 minutes, accumulating exceptions until it crashes. The current behavior wastes resources and creates confusing CrashLoopBackOff states.
Related Errors
This error cascades into multiple related failures:
- 403 authentication failure in
bot_confirm_exists(underlying cause) - Container bot: CrashLoopBackOff (result of bot exiting after 3 failures)
- Timeout waiting for pod to be ready (pod operator can't communicate with crashed pod)
- Task failures due to crashed bot
Recommended Fix
In ckit_service_exec.py line 55, after logging the 403 error, add sys.exit(1) to immediately exit on authentication failures, matching the behavior of 460 errors:
elif "403:" in err_str:
logger.error("Authentication failed - key doesn't work: %s", e)
sys.exit(1) # Add this lineImpact
Bots with invalid credentials waste resources retrying authentication 3 times before exiting. This creates confusing error patterns and delays detection of credential issues.
Occurrence Data
- Error ID: 697914e04baba41f22973d2c
- Investigated at: 2026-01-27T19:52:00Z
- Occurrence count: 1
- Affected pods: flexus-pod-bot-karen-20021-rx
- Affected namespaces: isolated
- Build info: flexus_commit=1c780beb, ckit_commit=ec1cf8fc
Related Files
flexus_client_kit/ckit_bot_exec.py
This issue was automatically created by Diplodocus Detective based on log analysis and investigation.