Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
984bfb2
initial update
engineeredcurlz Sep 26, 2025
aeab336
Merge branch 'main' into test-refactor
lokesh-keyan Oct 15, 2025
032f59b
wip: add create_deployment function to crud
lokesh-keyan Oct 16, 2025
bb4bca8
add import for handle_worload_operation function
Nov 5, 2025
47caf3c
add test for success
Nov 5, 2025
c435e36
change operation name
Nov 5, 2025
11be2fc
update operation name in test
Nov 5, 2025
5e73464
add test for failure
Nov 5, 2025
8530049
add exception test
Nov 5, 2025
b724860
Merge branch 'main' into test-refactor
engineeredcurlz Nov 5, 2025
4e7a73b
Linting error: removed elif and else
Nov 5, 2025
69ce1e1
Merge branch 'test-refactor' of https://github.com/Azure/telescope in…
Nov 5, 2025
7a8dad4
fixed the spacing
Nov 5, 2025
5e52b71
removed extra spaces
Nov 5, 2025
364c264
Add deployment_name for consistency and to reference later
Nov 6, 2025
0bc8275
verify deployment using wait condition
Nov 6, 2025
6604ac0
Add logging for maniest and to wait for deployment - debug
Nov 6, 2025
3e6beea
add logger for deployment success
Nov 6, 2025
ae99b54
verify pods are available in deployment
Nov 6, 2025
6623cb2
add failure count
Nov 6, 2025
8916a99
add logger to verify deployment
Nov 6, 2025
a4273ac
add unit test for create_deployment method
Nov 6, 2025
bf6143a
ran lint
Nov 6, 2025
712939e
Add test for deployment partial sucess
engineeredcurlz Mar 2, 2026
b6f248c
Add test for multiple deployments
engineeredcurlz Mar 2, 2026
e0d8037
Add test for progressive scaling failure
engineeredcurlz Mar 2, 2026
e6feced
Add test in node_pool_crud for returns false early exit
engineeredcurlz Mar 2, 2026
e3d142d
Add test in node_pool_crud for scale up fails but continues to scale …
engineeredcurlz Mar 2, 2026
769573e
Add test for node_pool_crud for scale down fails operation continues
engineeredcurlz Mar 2, 2026
ce8197e
Add test in node_pool_crud for deployment partial success
engineeredcurlz Mar 5, 2026
be9af74
pipeline test
engineeredcurlz Mar 16, 2026
af59e12
linting
engineeredcurlz Mar 16, 2026
ba865d5
yaml lint
engineeredcurlz Mar 16, 2026
1a348f9
add python security dependency
engineeredcurlz Mar 16, 2026
193af95
fix dependency
engineeredcurlz Mar 16, 2026
d433f19
Merge branch 'main' into test-refactor
engineeredcurlz Mar 16, 2026
7598c61
update
engineeredcurlz Mar 16, 2026
a006d4c
added matrix variables to pipeline
engineeredcurlz Mar 16, 2026
d1f5ebb
testing: set GPU_NODE_POOL to empty string
engineeredcurlz Mar 16, 2026
2ca536b
update vm size
engineeredcurlz Mar 16, 2026
8a96356
testing change vm size with available quota
engineeredcurlz Mar 16, 2026
a65546e
update node count + vm size
engineeredcurlz Mar 16, 2026
0c62bca
update: topology selection
engineeredcurlz Mar 16, 2026
2279753
add deployment step after scale-up operation
engineeredcurlz Mar 17, 2026
1028864
wire deployment parameters through k8s-crud-gpu topology
engineeredcurlz Mar 17, 2026
3ca87d7
correct deployment command routing and kwargs in handle_workload_oper…
engineeredcurlz Mar 17, 2026
8ceb256
correct topology name and add deployment matrix variables
engineeredcurlz Mar 17, 2026
b8f876e
update handle_workload_operations tests to match deployment command
engineeredcurlz Mar 17, 2026
767f0cc
fix yamllint and pylint warnings
engineeredcurlz Mar 17, 2026
d9ac29f
add correct indentation
engineeredcurlz Mar 17, 2026
e6ccf1f
iterate multi-doc YAML generator when applying deployment manifests
engineeredcurlz Mar 17, 2026
eae0409
refactor: seperate deploy workloads into its own pipelinee step
engineeredcurlz Mar 18, 2026
3a6a9b0
fix: execute k8s workload operations displayname
engineeredcurlz Mar 18, 2026
7a79129
fix: prevent infinite loop in azure node pool deployment tests
engineeredcurlz Mar 18, 2026
d0b6578
Merge branch 'main' into test-refactor
engineeredcurlz Mar 18, 2026
3fa6295
fix: await Azure LRO poller to prevent scale race condition
engineeredcurlz Mar 18, 2026
6b7a448
fix: replace hardcoded timeout with self.step_timeout in create_deplo…
engineeredcurlz Mar 26, 2026
8e3445c
refactor: convert f-string logger calls to %-style in create_deployment
engineeredcurlz Mar 26, 2026
2aac24d
feat: make label_selector derive from parameter
engineeredcurlz Mar 26, 2026
7522f1a
feat: remove hardcoding add namespace parameter
engineeredcurlz Mar 26, 2026
65fdd00
fix: remove --deployment-name CLI
engineeredcurlz Mar 26, 2026
b0be1b1
fix: use hyphen for --number-of-deployments
engineeredcurlz Mar 26, 2026
c5d01be
fix: return error on unknown workload command
engineeredcurlz Mar 26, 2026
946dea9
revert: restore original docstring line wrapping
engineeredcurlz Mar 26, 2026
e349428
Merge branch 'main' into test-refactor
engineeredcurlz Mar 26, 2026
1ed3985
Merge branch 'main' into test-refactor
engineeredcurlz Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions modules/python/clients/aks_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -464,12 +464,13 @@ def scale_node_pool(
node_pool.count = node_count

logger.info(f"Scaling node pool {node_pool_name} to {node_count} nodes")
self.aks_client.agent_pools.begin_create_or_update(
poller = self.aks_client.agent_pools.begin_create_or_update(
resource_group_name=self.resource_group,
resource_name=cluster_name,
agent_pool_name=node_pool_name,
parameters=node_pool,
)
poller.result() # Wait for Azure control plane to finish before proceeding

logger.info(
f"Waiting for {node_count} nodes in pool {node_pool_name} to be ready..."
Expand Down Expand Up @@ -676,12 +677,13 @@ def _progressive_scale(
"cluster_info", self.get_cluster_data(cluster_name)
)
node_pool.count = step # Update node count in the node pool object
result = self.aks_client.agent_pools.begin_create_or_update(
poller = self.aks_client.agent_pools.begin_create_or_update(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we add poller here ? We do not need it

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while running the pipeline I was getting failure error "OperationNotAllowed: Operation is not allowed because there's an in progress scale node pool operation". Operation was trying to move forward, while prior operation was still in progress.

begin_create_or_update was already returning a poller but it was being discarded. Adding poller.result (line 473) enforces a wait and check so that the prior operation can finish before moving on to the next. thereby fixing the failure I was receiving.

resource_group_name=self.resource_group,
resource_name=cluster_name,
agent_pool_name=node_pool_name,
parameters=node_pool,
)
result = poller.result() # Wait for Azure control plane to finish before proceeding

# Use agentpool=node_pool_name as default label if not specified
label_selector = f"agentpool={node_pool_name}"
Expand Down
119 changes: 119 additions & 0 deletions modules/python/crud/azure/node_pool_crud.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import logging
import time
import yaml

from clients.aks_client import AKSClient
from utils.logger_config import get_logger, setup_logging
Expand Down Expand Up @@ -270,3 +271,121 @@ def all(
logger.error(error_msg)
errors.append(error_msg)
return False

def create_deployment(
self,
node_pool_name,
replicas=10,
manifest_dir=None,
number_of_deployments=1,
label_selector="app=nginx-container",
namespace="default"
):
"""
Create Kubernetes deployments after node pool operations.

Args:
node_pool_name: Name of the node pool to target
deployment_name: Base name for the deployments
namespace: Kubernetes namespace (default: "default")
replicas: Number of deployment replicas per deployment (default: 10)
manifest_dir: Directory containing Kubernetes manifest files
number_of_deployments: Number of deployments to create (default: 1)

Returns:
True if all deployment creations were successful, False otherwise
"""
logger.info("Creating %d deployment(s)", number_of_deployments)
logger.info("Target node pool: %s", node_pool_name)
logger.info("Replicas per deployment: %d", replicas)
logger.info("Using manifest directory: %s", manifest_dir)

try:
# Get Kubernetes client from AKS client
k8s_client = self.aks_client.k8s_client

if not k8s_client:
logger.error("Kubernetes client not available")
return False

successful_deployments = 0

# Loop through number of deployments
for deployment_index in range(1, number_of_deployments + 1):
logger.info("Creating deployment %d/%d", deployment_index, number_of_deployments)

try:
if manifest_dir:
# Use the template path from manifest_dir
template_path = f"{manifest_dir}/deployment.yml"
else:
# Use default template path
template_path = "modules/python/crud/workload_templates/deployment.yml"

# Generate deployment name
deployment_name = f"myapp-{node_pool_name}-{deployment_index}"

# Create deployment template using k8s_client.create_template
deployment_template = k8s_client.create_template(
template_path,
{
"DEPLOYMENT_REPLICAS": replicas,
"NODE_POOL_NAME": node_pool_name,
"INDEX": deployment_index,
"LABEL_VALUE": label_selector.split("=", 1)[-1],
}
)

# Apply each document in the rendered multi-doc template
for doc in yaml.safe_load_all(deployment_template):
if doc:
k8s_client.apply_manifest_from_file(manifest_dict=doc)

logger.info("Applied manifest for deployment %s", deployment_name)

# Wait for deployment to be available (successful deployment verification)
logger.info("Waiting for deployment %s to become available...", deployment_name)
deployment_ready = k8s_client.wait_for_condition(
resource_type="deployment",
wait_condition_type="available",
resource_name=deployment_name,
namespace=namespace,
timeout_seconds=self.step_timeout
)

if deployment_ready:
logger.info("Deployment %s is successfully available", deployment_name)

# Additionally wait for pods to be ready
logger.info("Waiting for pods of deployment %s to be ready...", deployment_name)
k8s_client.wait_for_pods_ready(
operation_timeout_in_minutes=5,
namespace=namespace,
pod_count=replicas,
label_selector=label_selector
)

logger.info("Successfully created and verified deployment %d", deployment_index)
successful_deployments += 1
else:
logger.error("Deployment %s failed to become available within timeout", deployment_name)
continue

except Exception as e:
logger.error("Failed to create deployment %d: %s", deployment_index, e)
# Continue with next deployment instead of failing completely
continue

# Check if all deployments were successful
if successful_deployments == number_of_deployments:
logger.info("Successfully created all %d deployment(s)", number_of_deployments)
return True
if successful_deployments > 0:
logger.warning("Created %d/%d deployment(s)", successful_deployments, number_of_deployments)
return False
logger.error("Failed to create any deployments")
return False

except Exception as e:
logger.error("Failed to create deployments: %s", e)
return False
52 changes: 52 additions & 0 deletions modules/python/crud/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,33 @@ def handle_node_pool_operation(node_pool_crud, args):
logger.error(f"Error during '{command}' operation: {str(e)}")
return 1

def handle_workload_operations(node_pool_crud, args):
"""Handle workload operations (deployment, statefulset, jobs) based on the command"""
command = args.command
result = None

try:
if command == "deployment":
# Prepare deploy arguments
deploy_kwargs = {
"node_pool_name": args.node_pool_name,
"replicas": args.replicas,
"manifest_dir": args.manifest_dir,
"number_of_deployments": args.number_of_deployments
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add else here

else:
logger.error("Unknown workload command: '%s'", command)
return 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have added this to stop false successes and return logged errors


result = node_pool_crud.create_deployment(**deploy_kwargs)
else:
logger.error("Unknown workload command: '%s'", command)
return 1
# Check if the operation was successful
if result is False:
logger.error(f"Operation '{command}' failed")
return 1
return 0
except Exception as e:
logger.error(f"Error during '{command}' operation: {str(e)}")
return 1

def handle_node_pool_all(node_pool_crud, args):
"""Handle the all-in-one node pool operation command (create, scale up, scale down, delete)"""
Expand Down Expand Up @@ -320,6 +347,31 @@ def main():
)
all_parser.set_defaults(func=handle_node_pool_operation)

# Deployment command - add after the "all" command parser
deployment_parser = subparsers.add_parser(
"deployment", parents=[common_parser], help="create deployments"
)
deployment_parser.add_argument("--node-pool-name", required=True, help="Node pool name")
deployment_parser.add_argument(
"--number-of-deployments",
type=int,
default=1,
help="Number of deployments"
)
deployment_parser.add_argument(
"--replicas",
type=int,
default=10,
help="Number of deployment replicas"
)
deployment_parser.add_argument(
"--manifest-dir",
required=True,
help="Directory containing Kubernetes manifest files for the deployment"
)

deployment_parser.set_defaults(func=handle_workload_operations)

# Arguments provided, run node pool operations and collect benchmark results
try:
args = parser.parse_args()
Expand Down
34 changes: 34 additions & 0 deletions modules/python/crud/workload_templates/deployment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-{{NODE_POOL_NAME}}-{{INDEX}}
labels:
app: {{LABEL_VALUE}}
spec:
template:
metadata:
name:
labels:
app: {{LABEL_VALUE}}
spec:
containers:
- name: {{LABEL_VALUE}}
image: mcr.microsoft.com/oss/nginx/nginx:1.21.6
ports:
- containerPort: 80
replicas: {{DEPLOYMENT_REPLICAS}}
selector:
matchLabels:
app: {{LABEL_VALUE}}
---
apiVersion: v1
kind: Service
metadata:
name: myapp-{{NODE_POOL_NAME}}-{{INDEX}}
spec:
ports:
- port: 80
name: myapp
clusterIP: None
selector:
app: {{LABEL_VALUE}}
2 changes: 1 addition & 1 deletion modules/python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ coverage==7.6.12
semver==3.0.4
requests==2.32.4
pyyaml==6.0.2
pyOpenSSL==24.0.0
pyopenssl>=24.0.0
Loading
Loading