DevOps Screening Challenge

Overview

This challenge runs entirely on your local machine using Docker and KIND (Kubernetes in Docker). The Terraform stack will install dependencies, create a local KIND cluster, and deploy broken workloads for you to debug.

Prerequisites

OS: Ubuntu/Debian-based Linux (scripts use apt-get)
Docker: Will be installed by the setup scripts if not present
Terraform: v0.13+ (install locally)
sudo access: Required for installing packages and tools

Setup

Clone this repository:

git clone git@github.com:sanjay-fiftyfive/devops-hiring-assignment.git
cd devops-hiring-assignment

Deploy the environment using Terraform:
```
cd terraform
terraform init
terraform apply
```
This will:
- Install Docker, kubectl, and KIND locally
- Create a 4-node KIND cluster (1 control-plane + 3 workers)
- Deploy all challenge workloads
- Apply sabotage to create the debugging challenges

Verify the cluster is running:

kubectl --context kind-sanjay-challenge get nodes

Rules

You MAY NOT modify the existing Terraform code or Kubernetes manifests that deploy the initial cluster. If the initial deployment fails, you may debug and fix it.
You may install anything you need on your local machine.
Document your work — for each challenge, create a file solution-N.md describing:
- What symptoms you observed
- What tools you used to investigate
- What the root cause was and how you confirmed it
- What you did to fix it
- How you verified the fix

Time Limit

Total: 4 hours
Suggested pace: 20 / 30 / 40 / 45 / 45 / 50 minutes per challenge
Partial solutions are valued — show your debugging process

Challenges

Challenge 1: Deploy the Cluster

Deploy the Terraform stack in the terraform/ directory to create a KIND cluster on your local machine.

The Terraform code has issues that will prevent a successful deployment. Debug and fix them to get the cluster running.

After deployment you should have:

A 4-node KIND cluster (1 control-plane + 3 workers)
kubectl access via context kind-sanjay-challenge

Verify: kubectl --context kind-sanjay-challenge get nodes

Hint: Start with terraform init and work through the errors one at a time. There are multiple issues across different Terraform concepts.

Challenge 2: Fix the Broken Deployment

In namespace t2, the deployment task-2 wants 3 healthy replicas but all pods are failing.

Goal: Get all 3 replicas of task-2 running and ready.

Hint: There are multiple issues. The first fix won't be the last.

Challenge 3: Network Black Hole

In namespace t3, there is a deployment task-3 running a standard nginx server with a service exposing port 80.

In the default namespace, a pod debug-client (with full networking tools) has been deployed.

Goal: From inside debug-client, successfully run:

curl http://task-3.t3.svc.cluster.local

It should return the nginx welcome page.

Hint: There are multiple layers blocking connectivity. The obvious one isn't the only one.

Challenge 4: Node Recovery

Node sanjay-challenge-worker2 has gone NotReady.

Goal: Bring the node back to Ready status and ensure it can schedule and run pods.

Hint: You'll need to get inside the node's container to debug. The node runs as a Docker container — use docker exec to access it. The issue is not a single problem.

Challenge 5: TLS Certificate Debugging

In namespace t5, there is a deployment secure-app running nginx configured for HTTPS on port 443, and a pod tls-client with curl installed.

A CA bundle is mounted at /etc/ssl/custom/ca.crt inside the tls-client pod.

Goal: Make this command succeed without using --insecure or -k:

kubectl exec tls-client -n t5 -- curl --cacert /etc/ssl/custom/ca.crt https://secure-app.t5.svc.cluster.local

It must return: TLS Challenge Complete!

Hint: There are multiple certificate-related issues. The server may not even start initially.

Challenge 6: Performance Triage Under Load

In namespace t6, the deployment api-server is experiencing intermittent failures. A load-generator pod is sending continuous traffic to the service.

Rules:

Do NOT stop or delete the load-generator pod
Do NOT reduce the load generator's request rate

Goal: Make the api-server handle all requests successfully with response times under 2 seconds. An HPA is configured but isn't working.

Hint: There are resource constraints at multiple levels. Check what's limiting scaling.

Tools You Should Know

Tool	Usage
`kubectl`	Cluster interaction, pod debugging, log reading
`docker exec`	Access KIND node containers
`systemctl` / `journalctl`	Linux service management inside nodes
`tcpdump` / `nslookup` / `dig`	Network and DNS debugging
`openssl`	TLS certificate inspection
`iptables`	Firewall rule inspection
`strace`	System call tracing
`df` / `htop` / `top`	Resource monitoring

Cleanup

To tear down the entire environment:

kind delete cluster --name sanjay-challenge

Or to destroy everything Terraform created:

cd terraform
terraform destroy

Submission

When complete:

Ensure all fixes are applied on the cluster
Your solution-N.md files describe your debugging process
Record a Loom video (15-20 minutes) walking through your solutions:
- Briefly explain each challenge and what you found
- Show the fix in action (e.g., run the verification commands live)
- Highlight your debugging approach — what tools you used and why
- Share the Loom link along with your solution files

Note: Keep your local environment running after submission. In the next round, you will be asked to perform additional tasks on the same cluster in a live session.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
kubernetes		kubernetes
scripts		scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DevOps Screening Challenge

Overview

Prerequisites

Setup

Rules

Time Limit

Challenges

Challenge 1: Deploy the Cluster

Challenge 2: Fix the Broken Deployment

Challenge 3: Network Black Hole

Challenge 4: Node Recovery

Challenge 5: TLS Certificate Debugging

Challenge 6: Performance Triage Under Load

Tools You Should Know

Cleanup

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

DevOps Screening Challenge

Overview

Prerequisites

Setup

Rules

Time Limit

Challenges

Challenge 1: Deploy the Cluster

Challenge 2: Fix the Broken Deployment

Challenge 3: Network Black Hole

Challenge 4: Node Recovery

Challenge 5: TLS Certificate Debugging

Challenge 6: Performance Triage Under Load

Tools You Should Know

Cleanup

Submission

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages