Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/docs/how-tos/debug-deployment.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
title: Debug a deployment
slug: /how-tos/debug-deployment
description: How to diagnose a stuck or failing NKP deployment — covering both the nic deploy phase and post-deploy cluster health.
---

This guide covers debugging a stuck or failing NKP deployment. It is scoped to NKP only — for Classic Nebari, see [Troubleshooting](/classic/troubleshooting).

:::note
The steps below that use `kubectl` require the CLI installed and `KUBECONFIG` set. See [Deploy a cluster](/docs/how-tos/deploy-cluster#retrieve-the-kubeconfig) for how to retrieve the kubeconfig.
:::

## `nic deploy` phase

### Reading `nic` output

`nic deploy` streams structured progress to stdout. When a deploy fails, the error appears at the end of that output. Capture it for sharing or offline review:

```bash
nic deploy -f config.yaml 2>&1 | tee deploy.log
```

For deeper traces — showing exactly which internal step hung — enable the built-in OpenTelemetry console exporter (`OTEL_EXPORTER` is a `nic`-specific variable; valid values are `none`, `console`, `otlp`, and `both`):

```bash
OTEL_EXPORTER=console nic deploy -f config.yaml
```

This emits span data for every internal step, including provider API calls and ArgoCD bootstrap.

### Common `nic deploy` failures

| Symptom | Cause | Fix |
|---|---|---|
| Exits immediately with a config or credential error | Pre-flight validation failed | Run `nic validate -f config.yaml` and fix the reported fields before re-deploying |
| `context deadline exceeded` | Large cluster or slow network | Re-run with `--timeout 1h` |
| Cloud API error (quota exceeded, insufficient permissions) | Provider-specific | Check your cloud provider's console; fix IAM permissions or request a quota increase, then re-run (`nic deploy` is idempotent) |
| Hangs after "ArgoCD installed" | ArgoCD can't reach the GitOps repository | Verify `GIT_TOKEN` has read/write access to the repo, then check ArgoCD server logs (see [ArgoCD applications](#argocd-applications) below) |

## Post-deploy phase

After `nic deploy` returns, the cluster still needs ArgoCD to sync and start all foundational services. This section covers debugging that process in dependency order.

### ArgoCD applications

Start here — ArgoCD is the engine that delivers everything else.

List all applications and their sync and health status:

```bash
kubectl get applications -n argocd
```

Inspect a stuck or degraded application:

```bash
kubectl describe application <name> -n argocd
```

What to look for:

- **`SyncStatus`**: `Synced` (good), `OutOfSync` (hasn't applied the latest manifests), `Unknown` (can't reach the source repo — check the ArgoCD server logs).
- **`Health`**: `Healthy` (good), `Progressing` (still starting up), `Degraded` (something is wrong).
- A `Degraded` application includes a human-readable message in `Status.Conditions` — read it before looking elsewhere.

Check ArgoCD server logs directly:

```bash
kubectl logs -n argocd -l app.kubernetes.io/component=server --tail=100
```

Common causes of stuck ArgoCD applications: `GIT_TOKEN` expired or missing repo access, or the cluster can't pull images from the OCI registry.

### Foundational software pods

Each foundational component runs in its own namespace. Check for pods that are not `Running`:

| Component | Namespace |
|---|---|
| cert-manager | `cert-manager` |
| Envoy Gateway | `envoy-gateway-system` |
| Keycloak | `keycloak` |
| OpenTelemetry Collector | `monitoring` |

General pattern for any component:

```bash
kubectl get pods -n <namespace>
kubectl logs -n <namespace> <pod-name>
```

**cert-manager:** A healthy ArgoCD application for cert-manager does not guarantee that TLS certificates are being issued. Also check:

```bash
kubectl get certificates -A
kubectl describe certificate <name> -n <namespace>
```

A `False` ready status usually means a DNS-01 challenge timeout or a Let's Encrypt rate limit. The `describe` output includes the challenge URL and the specific error message from the ACME server.

**Keycloak:** If Keycloak pods are running but the sign-in page is unreachable, check whether the HTTPRoute was created:

```bash
kubectl get httproutes -A
```

### Nebari Operator

Check the operator pod:

```bash
kubectl get pods -n nebari-operator-system
kubectl logs -n nebari-operator-system -l control-plane=controller-manager --tail=100
```

The most common failure is the operator starting before Keycloak is healthy and failing to connect. Fix: wait for the `keycloak` ArgoCD application to reach `Healthy`, then delete the operator pod to trigger a restart:

```bash
kubectl delete pod -n nebari-operator-system -l control-plane=controller-manager
```

### NebariApp status

List all NebariApps across namespaces:

```bash
kubectl get nebariapps -A
```

Inspect the status conditions on a specific NebariApp:

```bash
kubectl describe nebariapp <name> -n <namespace>
```

The four status conditions and what each one tracks:

| Condition | What it tracks |
|---|---|
| `RoutingReady` | HTTPRoute created and accepted by Envoy Gateway |
| `TLSReady` | TLS certificate issued by cert-manager |
| `AuthReady` | OIDC client provisioned in Keycloak |
| `Ready` | All three above are `True` |

Each condition includes a `Message` field explaining why it is `False` — read that before checking logs.

## Symptom index

| Symptom | Where to look |
|---|---|
| `nic deploy` exits with an error immediately | [`nic deploy` common failures](#common-nic-deploy-failures) |
| `nic deploy` times out | [`nic deploy` common failures](#common-nic-deploy-failures) — timeout row |
| `nic deploy` hangs after "ArgoCD installed" | [`nic deploy` common failures](#common-nic-deploy-failures); [ArgoCD applications](#argocd-applications) |
| ArgoCD app stuck in `Progressing` or `Degraded` | [ArgoCD applications](#argocd-applications) |
| Pod in `CrashLoopBackOff` or `ImagePullBackOff` | [Foundational software pods](#foundational-software-pods) |
| Nebari Operator pod crashing or not connecting to Keycloak | [Nebari Operator](#nebari-operator) |
| TLS certificate not issuing | [Foundational software pods — cert-manager](#foundational-software-pods) |
| Can't sign in / Keycloak error | [Foundational software pods — Keycloak](#foundational-software-pods) |
| `NebariApp` not `Ready` | [NebariApp status](#nebariapp-status) |

## Related pages

- [NKP architecture](/docs/explanations/nkp-architecture) — understand how the layers fit together before debugging
- [Deploy a cluster](/docs/how-tos/deploy-cluster) — the deploy steps this page assumes you've run
- [Providers](/docs/how-tos/providers) — provider-specific prerequisites and configuration
1 change: 1 addition & 0 deletions docs/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ module.exports = {
"how-tos/update-cluster",
"how-tos/destroy-cluster",
"how-tos/keycloak-auth",
"how-tos/debug-deployment",
{
type: "category",
label: "Providers",
Expand Down
Loading