From 988995b814ce5d8f1ccdc432da3a433fd5c0691a Mon Sep 17 00:00:00 2001 From: andrewfulton9 Date: Wed, 24 Jun 2026 10:28:51 -0600 Subject: [PATCH 1/5] docs: scaffold debug-deployment how-to page (issue #660) --- docs/docs/how-tos/debug-deployment.mdx | 11 +++++++++++ docs/sidebars.js | 1 + 2 files changed, 12 insertions(+) create mode 100644 docs/docs/how-tos/debug-deployment.mdx diff --git a/docs/docs/how-tos/debug-deployment.mdx b/docs/docs/how-tos/debug-deployment.mdx new file mode 100644 index 00000000..f61e341e --- /dev/null +++ b/docs/docs/how-tos/debug-deployment.mdx @@ -0,0 +1,11 @@ +--- +title: Debug a deployment +slug: /how-tos/debug-deployment +description: How to diagnose a stuck or failing NKP deployment — covering both the nic deploy phase and post-deploy cluster health. +--- + +:::note +This guide covers NKP only. For Classic Nebari, see the Classic documentation. +::: + +Content coming soon. diff --git a/docs/sidebars.js b/docs/sidebars.js index 6be21d5b..8b14be9f 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -49,6 +49,7 @@ module.exports = { "how-tos/update-cluster", "how-tos/destroy-cluster", "how-tos/keycloak-auth", + "how-tos/debug-deployment", { type: "category", label: "Providers", From b45890c441757455b1cb7d8cb743b99e90eaba00 Mon Sep 17 00:00:00 2001 From: andrewfulton9 Date: Wed, 24 Jun 2026 10:46:36 -0600 Subject: [PATCH 2/5] docs: write debug-deployment how-to page (issue #660) --- docs/docs/how-tos/debug-deployment.mdx | 157 ++++++++++++++++++++++++- 1 file changed, 155 insertions(+), 2 deletions(-) diff --git a/docs/docs/how-tos/debug-deployment.mdx b/docs/docs/how-tos/debug-deployment.mdx index f61e341e..4af84f0e 100644 --- a/docs/docs/how-tos/debug-deployment.mdx +++ b/docs/docs/how-tos/debug-deployment.mdx @@ -4,8 +4,161 @@ slug: /how-tos/debug-deployment description: How to diagnose a stuck or failing NKP deployment — covering both the nic deploy phase and post-deploy cluster health. --- +This guide covers debugging a stuck or failing NKP deployment. It is scoped to NKP only — Classic Nebari troubleshooting lives under `/classic/`. + :::note -This guide covers NKP only. For Classic Nebari, see the Classic documentation. +The steps below that use `kubectl` require the CLI installed and `KUBECONFIG` set. See [Deploy a cluster](/docs/how-tos/deploy-cluster#retrieve-the-kubeconfig) for how to retrieve the kubeconfig. ::: -Content coming soon. +## `nic deploy` phase + +### Reading `nic` output + +`nic deploy` streams structured progress to stdout. When a deploy fails, the error appears at the end of that output. Capture it for sharing or offline review: + +```bash +nic deploy -f config.yaml 2>&1 | tee deploy.log +``` + +For deeper traces — showing exactly which internal step hung — enable the built-in OpenTelemetry console exporter: + +```bash +OTEL_EXPORTER=console nic deploy -f config.yaml +``` + +This emits span data for every internal step, including provider API calls and ArgoCD bootstrap. + +### Common `nic deploy` failures + +| Symptom | Cause | Fix | +|---|---|---| +| Exits immediately with a config or credential error | Pre-flight validation failed | Run `nic validate -f config.yaml` and fix the reported fields before re-deploying | +| `context deadline exceeded` | Large cluster or slow network | Re-run with `--timeout 1h` | +| Cloud API error (quota exceeded, insufficient permissions) | Provider-specific | Check your cloud provider's console; fix IAM permissions or request a quota increase, then re-run (`nic deploy` is idempotent) | +| Hangs after "ArgoCD installed" | ArgoCD can't reach the GitOps repository | Verify `GIT_TOKEN` has read/write access to the repo, then check ArgoCD server logs (see [ArgoCD applications](#argocd-applications) below) | + +## Post-deploy phase + +After `nic deploy` returns, the cluster still needs ArgoCD to sync and start all foundational services. This section covers debugging that process in dependency order. + +### ArgoCD applications + +Start here — ArgoCD is the engine that delivers everything else. + +List all applications and their sync and health status: + +```bash +kubectl get applications -n argocd +``` + +Inspect a stuck or degraded application: + +```bash +kubectl describe application -n argocd +``` + +What to look for: + +- **`SyncStatus`**: `Synced` means ArgoCD has applied the latest manifests; `OutOfSync` means it hasn't. +- **`Health`**: `Healthy` (good), `Progressing` (still starting up), `Degraded` (something is wrong). +- A `Degraded` application includes a human-readable message in `Status.Conditions` — read it before looking elsewhere. + +Check ArgoCD server logs directly: + +```bash +kubectl logs -n argocd -l app.kubernetes.io/component=server --tail=100 +``` + +Common causes of stuck ArgoCD applications: `GIT_TOKEN` expired or missing repo access, or the cluster can't pull images from the OCI registry. + +### Foundational software pods + +Each foundational component runs in its own namespace. Check for pods that are not `Running`: + +| Component | Namespace | +|---|---| +| cert-manager | `cert-manager` | +| Envoy Gateway | `envoy-gateway-system` | +| Keycloak | `keycloak` | +| OpenTelemetry Collector | `monitoring` | + +General pattern for any component: + +```bash +kubectl get pods -n +kubectl logs -n +``` + +**cert-manager:** A healthy ArgoCD application for cert-manager does not guarantee that TLS certificates are being issued. Also check: + +```bash +kubectl get certificates -A +kubectl describe certificate -n +``` + +A `False` ready status usually means a DNS-01 challenge timeout or a Let's Encrypt rate limit. The `describe` output includes the challenge URL and the specific error message from the ACME server. + +**Keycloak:** If Keycloak pods are running but the sign-in page is unreachable, check whether the HTTPRoute was created: + +```bash +kubectl get httproutes -A +``` + +### Nebari Operator + +Check the operator pod: + +```bash +kubectl get pods -n nebari-operator-system +kubectl logs -n nebari-operator-system -l control-plane=controller-manager --tail=100 +``` + +The most common failure is the operator starting before Keycloak is healthy and failing to connect. Fix: wait for the `keycloak` ArgoCD application to reach `Healthy`, then delete the operator pod to trigger a restart: + +```bash +kubectl delete pod -n nebari-operator-system -l control-plane=controller-manager +``` + +### NebariApp status + +List all NebariApps across namespaces: + +```bash +kubectl get nebariapps -A +``` + +Inspect the status conditions on a specific NebariApp: + +```bash +kubectl describe nebariapp -n +``` + +The four status conditions and what each one tracks: + +| Condition | What it tracks | +|---|---| +| `RoutingReady` | HTTPRoute created and accepted by Envoy Gateway | +| `TLSReady` | TLS certificate issued by cert-manager | +| `AuthReady` | OIDC client provisioned in Keycloak | +| `Ready` | All three above are `True` | + +Each condition includes a `Message` field explaining why it is `False` — read that before checking logs. + +## Symptom index + +| Symptom | Where to look | +|---|---| +| `nic deploy` exits with an error immediately | [`nic deploy` common failures](#common-nic-deploy-failures) | +| `nic deploy` times out | [`nic deploy` common failures](#common-nic-deploy-failures) — timeout row | +| `nic deploy` hangs after "ArgoCD installed" | [`nic deploy` common failures](#common-nic-deploy-failures); [ArgoCD applications](#argocd-applications) | +| ArgoCD app stuck in `Progressing` or `Degraded` | [ArgoCD applications](#argocd-applications) | +| Pod in `CrashLoopBackOff` or `ImagePullBackOff` | [Foundational software pods](#foundational-software-pods) | +| TLS certificate not issuing | [Foundational software pods — cert-manager](#foundational-software-pods) | +| Can't sign in / Keycloak error | [Foundational software pods — Keycloak](#foundational-software-pods) | +| `NebariApp` not `Ready` | [NebariApp status](#nebariapp-status) | + +## Related pages + +- [NKP architecture](/docs/explanations/nkp-architecture) — understand how the layers fit together before debugging +- [Deploy a cluster](/docs/how-tos/deploy-cluster) — the deploy steps this page assumes you've run +- [Providers](/docs/how-tos/providers) — provider-specific prerequisites and configuration From 89f6bed3c69b60495587a620e17fc114960f99c0 Mon Sep 17 00:00:00 2001 From: andrewfulton9 Date: Wed, 24 Jun 2026 15:00:13 -0600 Subject: [PATCH 3/5] docs: fix symptom index and clarify OTEL_EXPORTER (issue #660) - Add Nebari Operator row to symptom index table - Clarify OTEL_EXPORTER is a nic-specific variable with valid values --- docs/docs/how-tos/debug-deployment.mdx | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/docs/how-tos/debug-deployment.mdx b/docs/docs/how-tos/debug-deployment.mdx index 4af84f0e..058078db 100644 --- a/docs/docs/how-tos/debug-deployment.mdx +++ b/docs/docs/how-tos/debug-deployment.mdx @@ -20,7 +20,7 @@ The steps below that use `kubectl` require the CLI installed and `KUBECONFIG` se nic deploy -f config.yaml 2>&1 | tee deploy.log ``` -For deeper traces — showing exactly which internal step hung — enable the built-in OpenTelemetry console exporter: +For deeper traces — showing exactly which internal step hung — enable the built-in OpenTelemetry console exporter (`OTEL_EXPORTER` is a `nic`-specific variable; valid values are `none`, `console`, `otlp`, and `both`): ```bash OTEL_EXPORTER=console nic deploy -f config.yaml @@ -153,6 +153,7 @@ Each condition includes a `Message` field explaining why it is `False` — read | `nic deploy` hangs after "ArgoCD installed" | [`nic deploy` common failures](#common-nic-deploy-failures); [ArgoCD applications](#argocd-applications) | | ArgoCD app stuck in `Progressing` or `Degraded` | [ArgoCD applications](#argocd-applications) | | Pod in `CrashLoopBackOff` or `ImagePullBackOff` | [Foundational software pods](#foundational-software-pods) | +| Nebari Operator pod crashing or not connecting to Keycloak | [Nebari Operator](#nebari-operator) | | TLS certificate not issuing | [Foundational software pods — cert-manager](#foundational-software-pods) | | Can't sign in / Keycloak error | [Foundational software pods — Keycloak](#foundational-software-pods) | | `NebariApp` not `Ready` | [NebariApp status](#nebariapp-status) | From 0463352ae7256cd9c10b0c773ad8d9065e203294 Mon Sep 17 00:00:00 2001 From: andrewfulton9 Date: Wed, 24 Jun 2026 15:33:11 -0600 Subject: [PATCH 4/5] docs: link Classic troubleshooting page from debug-deployment (issue #660) --- docs/docs/how-tos/debug-deployment.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/how-tos/debug-deployment.mdx b/docs/docs/how-tos/debug-deployment.mdx index 058078db..35b0a765 100644 --- a/docs/docs/how-tos/debug-deployment.mdx +++ b/docs/docs/how-tos/debug-deployment.mdx @@ -4,7 +4,7 @@ slug: /how-tos/debug-deployment description: How to diagnose a stuck or failing NKP deployment — covering both the nic deploy phase and post-deploy cluster health. --- -This guide covers debugging a stuck or failing NKP deployment. It is scoped to NKP only — Classic Nebari troubleshooting lives under `/classic/`. +This guide covers debugging a stuck or failing NKP deployment. It is scoped to NKP only — for Classic Nebari, see [Troubleshooting](/classic/troubleshooting). :::note The steps below that use `kubectl` require the CLI installed and `KUBECONFIG` set. See [Deploy a cluster](/docs/how-tos/deploy-cluster#retrieve-the-kubeconfig) for how to retrieve the kubeconfig. From a57f21fe8cd4ac0d84583a24b0eee5802cf64f13 Mon Sep 17 00:00:00 2001 From: andrewfulton9 Date: Wed, 24 Jun 2026 15:36:35 -0600 Subject: [PATCH 5/5] docs: add ArgoCD Unknown sync status to debug-deployment (issue #660) --- docs/docs/how-tos/debug-deployment.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/how-tos/debug-deployment.mdx b/docs/docs/how-tos/debug-deployment.mdx index 35b0a765..6cb5895d 100644 --- a/docs/docs/how-tos/debug-deployment.mdx +++ b/docs/docs/how-tos/debug-deployment.mdx @@ -59,7 +59,7 @@ kubectl describe application -n argocd What to look for: -- **`SyncStatus`**: `Synced` means ArgoCD has applied the latest manifests; `OutOfSync` means it hasn't. +- **`SyncStatus`**: `Synced` (good), `OutOfSync` (hasn't applied the latest manifests), `Unknown` (can't reach the source repo — check the ArgoCD server logs). - **`Health`**: `Healthy` (good), `Progressing` (still starting up), `Degraded` (something is wrong). - A `Degraded` application includes a human-readable message in `Status.Conditions` — read it before looking elsewhere.