Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/aks-sreclaw/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Ignore Poetry artifacts
poetry.lock
pyproject.toml
17 changes: 17 additions & 0 deletions src/aks-sreclaw/HISTORY.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.. :changelog:

Release History
===============

Guidance
++++++++
If there is no rush to release a new version, please just add a description of the modification under the *Pending* section.

To release a new version, please select a new version number (usually plus 1 to last patch version, X.Y.Z -> Major.Minor.Patch, more details in `\doc <https://semver.org/>`_), and then add a new section named as the new version number in this file, the content should include the new modifications and everything from the *Pending* section. Finally, update the `VERSION` variable in `setup.py` with this new version number.

Pending
+++++++

1.0.0b1
+++++++
* Add AKS SREClaw `az aks claw`.
186 changes: 186 additions & 0 deletions src/aks-sreclaw/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
Azure CLI AKS SREClaw Extension
================================

This extension provides commands to manage AKS SREClaw, an autonomous AI-powered troubleshooting assistant for Azure Kubernetes Service clusters.

Installation
------------

To install the extension:

.. code-block:: bash

az extension add --name aks-sreclaw

Usage
-----

Deploy SREClaw to your AKS cluster
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Initialize and deploy SREClaw with interactive LLM configuration:

.. code-block:: bash

az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

This command will:

1. Prompt you to select an LLM provider (Azure OpenAI or OpenAI)
2. Guide you through entering model names and API credentials
3. Validate the connection to your LLM provider
4. Prompt for a Kubernetes service account name
5. Deploy the SREClaw helm chart to your cluster
6. Wait for pods to be ready (up to 5 minutes)

Deploy without waiting for completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --no-wait

Check deployment status
~~~~~~~~~~~~~~~~~~~~~~~

View the current status of your SREClaw deployment:

.. code-block:: bash

az aks claw status --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

This displays:

- Helm release status
- Deployment replica counts
- Pod status and readiness
- Configured LLM providers with models and API endpoints

Connect to SREClaw service
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Establish a port-forward connection to access the SREClaw web interface:

.. code-block:: bash

az aks claw connect --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

The command will:

- Display the gateway authentication token
- Create a port-forward to localhost:18789
- Provide instructions to open the service in your browser

To use a different local port:

.. code-block:: bash

az aks claw connect --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --local-port 8080

Press Ctrl+C to stop the port-forwarding.

Delete SREClaw
~~~~~~~~~~~~~~

Uninstall SREClaw and clean up all resources:

.. code-block:: bash

az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

This command will:

1. Prompt for confirmation
2. Uninstall the SREClaw helm chart
3. Delete all associated secrets and configurations
4. Wait for pods to be removed

To delete without waiting:

.. code-block:: bash

az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --no-wait

LLM Provider Configuration
---------------------------

Azure OpenAI
~~~~~~~~~~~~

When prompted during deployment, select Azure OpenAI and provide:

- **Models**: Comma-separated model names (e.g., ``gpt-5.4,gpt-5.1``)
- **API Key**: Your Azure OpenAI API key
- **API Base**: Your Azure OpenAI endpoint (e.g., ``https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/``)

OpenAI
~~~~~~

When prompted during deployment, select OpenAI and provide:

- **Models**: Comma-separated model names (e.g., ``gpt-5.4,gpt-5.1``)
- **API Key**: Your OpenAI API key

Prerequisites
-------------

- Azure CLI installed
- An AKS cluster
- kubectl configured to access your cluster
- Appropriate permissions to deploy resources to your AKS cluster
- An LLM provider account (Azure OpenAI or OpenAI) with API access

Service Account Requirements
-----------------------------

SREClaw requires a Kubernetes service account with:

- Appropriate Role and RoleBinding in the target namespace
- For Azure resource access: annotation with ``azure.workload.identity/client-id: <managed-identity-client-id>``

Ensure you create these before running ``az aks claw create``.

Troubleshooting
---------------

Check deployment status
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

az aks claw status --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

View pod logs
~~~~~~~~~~~~~

.. code-block:: bash

kubectl logs -n kube-system -l app.kubernetes.io/name=aks-sreclaw

Verify helm release
~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

helm list -n kube-system

Uninstall and reinstall
~~~~~~~~~~~~~~~~~~~~~~~~

If you encounter issues:

.. code-block:: bash

az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system

Support
-------

For issues and feature requests, please visit:
https://github.com/Azure/azure-cli-extensions

License
-------

This extension is licensed under the MIT License. See LICENSE.txt for details.
45 changes: 45 additions & 0 deletions src/aks-sreclaw/azext_aks_sreclaw/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------

# pylint: disable=unused-import
import azext_aks_sreclaw._help
from azext_aks_sreclaw._client_factory import CUSTOM_MGMT_AKS
from azure.cli.core import AzCommandsLoader
from azure.cli.core.profiles import register_resource_type


def register_aks_sreclaw_resource_type():
register_resource_type(
"latest",
CUSTOM_MGMT_AKS,
None,
)


class ContainerServiceCommandsLoader(AzCommandsLoader):

def __init__(self, cli_ctx=None):
from azure.cli.core.commands import CliCommandType
register_aks_sreclaw_resource_type()

aks_sreclaw_custom = CliCommandType(operations_tmpl='azext_aks_sreclaw.custom#{}')
super().__init__(
cli_ctx=cli_ctx,
custom_command_type=aks_sreclaw_custom,
)

def load_command_table(self, args):
super().load_command_table(args)
from azext_aks_sreclaw.commands import load_command_table
load_command_table(self, args)
return self.command_table

def load_arguments(self, command):
super().load_arguments(command)
from azext_aks_sreclaw._params import load_arguments
load_arguments(self, command)


COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
23 changes: 23 additions & 0 deletions src/aks-sreclaw/azext_aks_sreclaw/_client_factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------

from azure.cli.core.commands.client_factory import get_mgmt_service_client
from azure.cli.core.profiles import CustomResourceType

CUSTOM_MGMT_AKS = CustomResourceType('azext_aks_sreclaw.vendored_sdks.azure_mgmt_containerservice.2025_10_01',
'ContainerServiceClient')

# Note: cf_xxx, as the client_factory option value of a command group at command declaration, it should ignore
# parameters other than cli_ctx; get_xxx_client is used as the client of other services in the command implementation,
# and usually accepts subscription_id as a parameter to reconfigure the subscription when sending the request


# container service clients
def get_container_service_client(cli_ctx, subscription_id=None):
return get_mgmt_service_client(cli_ctx, CUSTOM_MGMT_AKS, subscription_id=subscription_id)


def cf_managed_clusters(cli_ctx, *_):
return get_container_service_client(cli_ctx).managed_clusters
29 changes: 29 additions & 0 deletions src/aks-sreclaw/azext_aks_sreclaw/_consts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------

import os

# Configuration paths
home_dir = os.path.expanduser("~")

AGENT_NAMESPACE = "kube-system"
AKS_SRECLAW_LABEL_SELECTOR = "app.kubernetes.io/name=aks-sreclaw"

# Kubernetes WebSocket exec protocol constants
RESIZE_CHANNEL = 4 # WebSocket channel for terminal resize messages
# WebSocket heartbeat configuration (matching kubectl client-go)
# Based on kubernetes/client-go/tools/remotecommand/websocket.go#L59-L65
# pingPeriod = 5 * time.Second
# pingReadDeadline = (pingPeriod * 12) + (1 * time.Second)
# The read deadline is calculated to allow up to 12 missed pings plus 1 second buffer
# This provides tolerance for network delays while detecting actual connection failures
HEARTBEAT_INTERVAL = 5.0 # pingPeriod: 5 seconds between pings
HEARTBEAT_TIMEOUT = (HEARTBEAT_INTERVAL * 12) + 1 # pingReadDeadline: 61 seconds total timeout

# AKS SREClaw Version (shared by helm chart and docker image)
AKS_SRECLAW_VERSION = "0.0.0"

# Helm Configuration
HELM_VERSION = "3.16.0"
Loading
Loading