Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w
| `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component |
| Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted |
| gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` |
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` stage. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker |
| `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy <name> && openshell gateway start` |

## Full Diagnostic Dump
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release-auto-tag.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name: Release Auto-Tag
on:
workflow_dispatch: {}
schedule:
- cron: "0 2 * * *" # 7 PM PDT
- cron: "0 14 * * *" # 7 AM PDT

permissions:
contents: write
Expand Down
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files

## Cluster Infrastructure Changes

- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `Dockerfile.cluster`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.
- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `deploy/docker/Dockerfile.images`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes.

## Documentation

Expand Down
42 changes: 21 additions & 21 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,24 +63,24 @@ This project ships with [agent skills](#agent-skills-for-contributors) that can

Skills live in `.agents/skills/`. Your agent's harness can discover and load them natively. Here is the full inventory:

| Category | Skill | Purpose |
|----------|-------|---------|
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues |
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
| Contributing | `create-github-issue` | Create well-structured GitHub issues |
| Contributing | `create-github-pr` | Create pull requests with proper conventions |
| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions |
| Reviewing | `review-security-issue` | Assess security issues for severity and remediation |
| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs |
| Triage | `triage-issue` | Assess, classify, and route community-filed issues |
| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs |
| Platform | `tui-development` | Development guide for the ratatui-based terminal UI |
| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes |
| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files |
| Reference | `sbom` | Generate SBOMs and resolve dependency licenses |
| Category | Skill | Purpose |
| --------------- | ------------------------- | --------------------------------------------------------------------------------------------------- |
| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows |
| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues |
| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues |
| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue |
| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) |
| Contributing | `create-github-issue` | Create well-structured GitHub issues |
| Contributing | `create-github-pr` | Create pull requests with proper conventions |
| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions |
| Reviewing | `review-security-issue` | Assess security issues for severity and remediation |
| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs |
| Triage | `triage-issue` | Assess, classify, and route community-filed issues |
| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs |
| Platform | `tui-development` | Development guide for the ratatui-based terminal UI |
| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes |
| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files |
| Reference | `sbom` | Generate SBOMs and resolve dependency licenses |

### Workflow Chains

Expand Down Expand Up @@ -148,10 +148,10 @@ openshell sandbox create -- codex

Two additional scripts in `scripts/bin/` provide gateway-aware wrappers for cluster debugging:

| Script | What it does |
|--------|-------------|
| Script | What it does |
| --------- | ------------------------------------------------------------------------------------ |
| `kubectl` | Runs `kubectl` inside the active gateway's k3s container via `openshell doctor exec` |
| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` |
| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` |

These work for both local and remote gateways (SSH is handled automatically). Examples:

Expand Down
18 changes: 9 additions & 9 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 1 addition & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ resolver = "2"
members = ["crates/*"]

[workspace.package]
version = "0.1.0"
version = "0.0.0"
edition = "2024"
rust-version = "1.88"
license = "Apache-2.0"
Expand Down Expand Up @@ -124,8 +124,6 @@ ref_option = "allow" # Common pattern for optional references
missing_fields_in_debug = "allow" # Manual Debug impls often intentionally omit fields

[profile.release]
lto = "thin"
codegen-units = 1
strip = true

[profile.dev]
Expand Down
21 changes: 17 additions & 4 deletions architecture/build-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ OpenShell produces two container images, both published for `linux/amd64` and `l

The gateway runs the control plane API server. It is deployed as a StatefulSet inside the cluster container via a bundled Helm chart.

- **Dockerfile**: `deploy/docker/Dockerfile.gateway`
- **Docker target**: `gateway` in `deploy/docker/Dockerfile.images`
- **Registry**: `ghcr.io/nvidia/openshell/gateway:latest`
- **Pulled when**: Cluster startup (the Helm chart triggers the pull)
- **Entrypoint**: `openshell-server --port 8080` (gRPC + HTTP, mTLS)
Expand All @@ -15,11 +15,11 @@ The gateway runs the control plane API server. It is deployed as a StatefulSet i

The cluster image is a single-container Kubernetes distribution that bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary needed to bootstrap the control plane.

- **Dockerfile**: `deploy/docker/Dockerfile.cluster`
- **Docker target**: `cluster` in `deploy/docker/Dockerfile.images`
- **Registry**: `ghcr.io/nvidia/openshell/cluster:latest`
- **Pulled when**: `openshell gateway start`

The supervisor binary (`openshell-sandbox`) is cross-compiled in a build stage and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.
The supervisor binary (`openshell-sandbox`) is built by the shared `supervisor-builder` stage in `deploy/docker/Dockerfile.images` and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images.

## Sandbox Images

Expand All @@ -42,7 +42,7 @@ The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes
| Changed files | Rebuild triggered |
|---|---|
| Cargo manifests, proto definitions, cross-build script | Gateway + supervisor |
| `crates/openshell-server/*`, `Dockerfile.gateway` | Gateway |
| `crates/openshell-server/*`, `deploy/docker/Dockerfile.images` | Gateway |
| `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor |
| `deploy/helm/openshell/*` | Helm upgrade |

Expand All @@ -58,3 +58,16 @@ mise run cluster -- supervisor # rebuild supervisor only
mise run cluster -- chart # helm upgrade only
mise run cluster -- all # rebuild everything
```

To validate incremental routing and BuildKit cache reuse locally, run:

```bash
mise run cluster:test:fast-deploy-cache
```

The harness runs isolated scenarios in temporary git worktrees, keeps its own state and cache under `.cache/cluster-deploy-fast-test/`, and writes a Markdown summary with:

- auto-detection checks for gateway-only, supervisor-only, shared, Helm-only, unrelated, and explicit-target changes
- cold vs warm rebuild comparisons for gateway and supervisor code changes
- container-ID invalidation coverage to verify gateway + Helm are retriggered when the cluster container changes

8 changes: 4 additions & 4 deletions architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Out of scope:
- `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd.
- `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution.
- `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming).
- `deploy/docker/Dockerfile.cluster`: Container image definition (k3s base + Helm charts + manifests + entrypoint).
- `deploy/docker/Dockerfile.images` (target `cluster`): Container image definition (k3s base + Helm charts + manifests + entrypoint).
- `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection).
- `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script.
- Docker daemon(s):
Expand Down Expand Up @@ -228,7 +228,7 @@ After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway nam

## Container Image

The gateway image is defined in `deploy/docker/Dockerfile.cluster`:
The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`:

```
Base: rancher/k3s:v1.35.2-k3s1
Expand Down Expand Up @@ -298,7 +298,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa

- `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`.
- When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`.
- `deploy/docker/Dockerfile.cluster` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image.
- `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory.
- `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery.
- k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically.
Expand Down Expand Up @@ -454,7 +454,7 @@ openshell/
- `crates/openshell-cli/src/main.rs` -- CLI command definitions
- `crates/openshell-cli/src/run.rs` -- CLI command implementations
- `crates/openshell-cli/src/bootstrap.rs` -- auto-bootstrap from sandbox create
- `deploy/docker/Dockerfile.cluster` -- container image definition
- `deploy/docker/Dockerfile.images` -- shared image build definition (cluster target)
- `deploy/docker/cluster-entrypoint.sh` -- container entrypoint script
- `deploy/docker/cluster-healthcheck.sh` -- Docker HEALTHCHECK script
- `deploy/kube/manifests/openshell-helmchart.yaml` -- OpenShell Helm chart manifest
Expand Down
5 changes: 4 additions & 1 deletion crates/openshell-core/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
}

// --- Protobuf compilation ---
// Use bundled protoc from protobuf-src
// Use bundled protoc from protobuf-src. The system protoc (from apt-get)
// does not bundle the well-known type includes (google/protobuf/struct.proto
// etc.), so we must use protobuf-src which ships both the binary and the
// include tree.
// SAFETY: This is run at build time in a single-threaded build script context.
// No other threads are reading environment variables concurrently.
#[allow(unsafe_code)]
Expand Down
Loading
Loading