Skip to content

medium: 403 authentication errors should exit immediately instead of retrying #182

@flexus-teams

Description

@flexus-teams

Error Summary

403 authentication errors are not handled as fatal errors in ckit_service_exec.py. When authentication fails, bots retry 3 times over 5 minutes before exiting, wasting resources and creating CrashLoopBackOff states.

Root Cause

Repository: smallcloudai/flexus-client-kit
File: flexus_client_kit/ckit_service_exec.py
Lines: 54-55
Function: run_typical_single_subscription_with_restart_on_network_errors

Why: When a 403 authentication error occurs, the code logs it but does not exit immediately (unlike 460 errors which call sys.exit(1)). Instead, the error is accumulated in the exception_times list and retried every 60 seconds until 3 exceptions occur within 5 minutes.

Authentication errors should be fatal and exit immediately since retrying won't fix invalid credentials.

Code Snippet

err_str = str(e)
if "460:" in err_str:
    logger.error("%s", e)
    sys.exit(1)
elif "403:" in err_str:
    logger.error("Authentication failed - key doesn't work: %s", e)
    # BUG: Should sys.exit(1) here
else:
    logger.info("got %s (attempt %d/3), sleep 60...", type(e).__name__, len(exception_times))
await ckit_shutdown.wait(60)

Git Blame

  • Commit: 4983917
  • Author: Kirill Starkov
  • Date: 2025-12-17
  • Message: "handle custom status codes"

Analysis

The root cause is a logic bug introduced in commit 4983917 where 403 authentication errors were given special logging but not made fatal like 460 errors. The authentication error originates from flexus_backend/flexus_v1/utils_superuser_password.py where it validates bot credentials.

When credentials are invalid, the bot should exit immediately rather than retry 3 times over 5 minutes, accumulating exceptions until it crashes. The current behavior wastes resources and creates confusing CrashLoopBackOff states.

Related Errors

This error cascades into multiple related failures:

  • 403 authentication failure in bot_confirm_exists (underlying cause)
  • Container bot: CrashLoopBackOff (result of bot exiting after 3 failures)
  • Timeout waiting for pod to be ready (pod operator can't communicate with crashed pod)
  • Task failures due to crashed bot

Recommended Fix

In ckit_service_exec.py line 55, after logging the 403 error, add sys.exit(1) to immediately exit on authentication failures, matching the behavior of 460 errors:

elif "403:" in err_str:
    logger.error("Authentication failed - key doesn't work: %s", e)
    sys.exit(1)  # Add this line

Impact

Bots with invalid credentials waste resources retrying authentication 3 times before exiting. This creates confusing error patterns and delays detection of credential issues.

Occurrence Data

  • Error ID: 697914e04baba41f22973d2c
  • Investigated at: 2026-01-27T19:52:00Z
  • Occurrence count: 1
  • Affected pods: flexus-pod-bot-karen-20021-rx
  • Affected namespaces: isolated
  • Build info: flexus_commit=1c780beb, ckit_commit=ec1cf8fc

Related Files

  • flexus_client_kit/ckit_bot_exec.py

This issue was automatically created by Diplodocus Detective based on log analysis and investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions