Skip to content

Implement sandlock-oci: OCI-compliant runtime for namespace-less sandboxing#31

Open
sachin2605 wants to merge 1 commit intomultikernel:mainfrom
bytehooks:main
Open

Implement sandlock-oci: OCI-compliant runtime for namespace-less sandboxing#31
sachin2605 wants to merge 1 commit intomultikernel:mainfrom
bytehooks:main

Conversation

@sachin2605
Copy link
Copy Markdown

Pull Request: Implement OCI-compliant Runtime for Sandlock Sandbox

Title

Implement sandlock-oci: OCI-compliant runtime for namespace-less sandboxing

Description

This PR introduces sandlock-oci, a low-level OCI runtime shim that enables the sandlock sandbox to be used as a drop-in replacement for runc in containerd, CRI-O, and Kubernetes environments.

Unlike traditional runtimes, sandlock-oci focuses on a "no-container" approach, providing isolation via Landlock and Seccomp-Notify without relying on Linux namespaces (PID, Mount, Network, etc.). This makes it ideal for high-performance, lightweight sandboxing of host-level processes.

Key Features

  • OCI Spec Translation: Complete mapping of OCI config.json (filesystem rules, resource limits, environment) to sandlock::Policy.
  • Lifecycle Management: Full implementation of OCI commands: create, start, state, kill, delete, and list.
  • Supervisor-Mediated Sandboxing: A background supervisor process manages child lifecycle and signal synchronization (SIGSTOP/SIGCONT).
  • Persistent State: Reliable state tracking in /run/sandlock-oci for integration with container engines.
  • Kubernetes Readiness: Support for RuntimeClass integration, enabling runtimeClassName: sandlock in Pod specs.

Implementation Details

  • crates/sandlock-oci: The new OCI shim crate.
  • src/spec.rs: OCI specification parsing and policy mapping logic.
  • src/supervisor.rs: Background daemon logic for process orchestration.
  • src/state.rs: JSON-based state persistence for OCI compliance.

Testing

  • Integration Tests: Comprehensive Rust test suite in tests/integration.rs covering the full OCI lifecycle.
  • Containerd E2E: Integration script (tests/containerd/test_containerd.sh) validating manual OCI lifecycle and containerd registration.
  • Kubernetes E2E: kind cluster test script (tests/kubernetes/test_kind.sh) for validating RuntimeClass and Pod deployments.

Checklist

  • Integration with sandlock-core for Landlock/Seccomp confinement
  • State persistence in /run/sandlock-oci
  • End-to-end validation with containerd and kind cluster

@congwang-mk
Copy link
Copy Markdown
Contributor

Code review

Overview

Adds a new crates/sandlock-oci workspace member implementing OCI runtime commands (create/start/state/kill/delete/list/check) with a double-fork supervisor that SIGSTOPs the child until start is called. The architectural shape is reasonable for a namespace-less OCI shim. CI is green, but a couple of substantive concerns aren't covered by the existing CI matrix — details below.

Blocking issues

1. The translated OCI policy is discarded — sandbox does nothing

crates/sandlock-oci/src/main.rs:484 builds a PolicyBuilder from the spec and immediately drops it:

let _builder = spec::spec_to_policy(&spec, &bundle)?;

Then at main.rs:536:

let policy = sandlock_core::Policy::builder()
    .build()
    .unwrap_or_else(|_| {
        sandlock_core::Policy::builder().build().expect("minimal policy")
    });

An empty policy is built and handed to the supervisor. As a result, none of the spec-derived rules are applied at runtime:

  • rootfs chroot, fs_read/fs_write/fs_mount
  • process.env, process.cwd
  • max_memory / max_processes / max_cpu
  • bind mounts

The unwrap_or_else fallback is the same call as the primary; it can't recover anything.

In addition, confine_current_process (crates/sandlock-core/src/lib.rs:58) only applies Landlock fs/IPC rules — it does not chdir, chroot, or setenv. Even if the builder were threaded through, the child still wouldn't enter the rootfs, get the spec env, or land at the spec cwd. execvp(cmd[0], …) resolves against the host $PATH, not the bundle's rootfs, so a containerd test that runs busybox actually runs the host's binary. Net effect: pods scheduled with runtimeClassName: sandlock are effectively unconfined, which is more dangerous than runc because users may assume sandlock-grade isolation.

CI doesn't catch this because no test creates a container, runs a process, and asserts that the rootfs/env/limits took effect. The shell scripts that would catch it aren't part of CI.

2. Integration tests duplicate production code rather than test it

tests/integration.rs:1577-1637 defines a private sandlock_oci_test_helpers module that re-implements stripped-down copies of load_spec, spec_to_policy, ContainerState, Status, and the setters; the assertions then run against those copies:

let spec = sandlock_oci_test_helpers::load_spec(dir.path())let builder = sandlock_oci_test_helpers::spec_to_policy(&spec, dir.path()).unwrap();

So spec_load_and_policy_mapping and state_created_lifecycle are green but they're validating the helper, not crate::spec/crate::state. The CLI-binary tests (oci_check_exits_zero, oci_state_unknown_container_errors, oci_list_no_containers, oci_kill_unknown_container_errors, oci_delete_nonexistent_is_ok) exercise the real binary, but only commands that don't require root and don't go through the supervisor path — i.e., none of the actual create→start→kill lifecycle.

Two ways forward: (a) add [lib] to crates/sandlock-oci/Cargo.toml re-exporting pub mod spec; pub mod state; and import from it directly in tests, or (b) move these checks into #[cfg(test)] mod tests blocks in the same src/*.rs files.

Significant issues

3. CI silently skips the bin's #[cfg(test)] tests, and they're broken

crates/sandlock-oci/src/state.rs:215 and :224 call state.set_running(12345) / set_running(1), but set_running at state.rs:122 takes no arguments. cargo test -p sandlock-oci --bins fails with two E0061 errors. CI misses it because .github/workflows/ci.yml:30 uses cargo test --release --lib, and sandlock-oci is a [[bin]] without a [lib] target. Worth fixing both the call sites and adding --bins (or a lib target) to CI so this can't drift again.

4. Race on pid_file

main.rs:548-559: parent waits for the intermediate fork, sleeps 100 ms, then reads state.pid and writes pid-file. The supervisor only updates state.pid after binding the socket and forking the grandchild. On a busy host this can exceed 100 ms, and the pid-file gets 0. containerd / CRI-O treat that as authoritative for tracking; a 0 will cause them to lose the container. Use a real synchronization primitive — e.g. a pipe inherited across the fork that the supervisor writes the PID to, or have the supervisor write the pid-file itself before going into accept().

5. run_supervisor accept loop is inconsistent

supervisor.rs:1319-1361 calls listener.set_nonblocking(false) (blocking), but a match arm handles WouldBlock with a sleep. The WouldBlock branch is dead code as written.

6. send_command framing is fragile

supervisor.rs:1239-1247: read loop breaks on \n, but serde_json::to_writer (lines 1336, 1351) writes no trailing newline. The client only exits via EOF, which only happens because each accept arm drops the stream. Works today, but small refactors (e.g. holding the stream open) would deadlock the client. Use writeln! on the supervisor side or a length-prefixed frame.

7. process.args empty fallback is unsafe

main.rs:491: falls back to vec!["sh".to_string()]. OCI requires args non-empty for a startable container; failing fast with a clear error is better than silently running sh from the host.

Smaller notes

  • spec.rs:750-754: hard-codes /usr, /lib, /lib64, /bin, /sbin, /etc, /proc, /dev as fs_read and /tmp as fs_write, ignoring root.readonly and the mounts list. Push these decisions through map_mounts instead.
  • spec.rs:820-823: silently skipping proc/sysfs/devpts/tmpfs/mqueue/cgroup{,2} mounts is the right call for a namespace-less runtime, but no diagnostic. A --debug-gated eprintln! would help triage.
  • state.rs:1064-1067: set_created(pid) keeps status = Created and only updates the pid; the name suggests a status transition. set_init_pid is clearer.
  • state.rs: STATE_DIR = "/run/sandlock-oci" is hard-coded. An env override (e.g. SANDLOCK_OCI_STATE_DIR) would unlock real lifecycle integration tests without root.
  • supervisor.rs: the supervisor doesn't chdir("/") after the second fork, so it pins the caller's cwd.
  • state.rs: the Stopped transition doesn't capture exit code or terminating signal — worth adding for CRI consumers.
  • Build warnings to clean up: unused bail import in spec.rs:6, unused FsIsolation import in spec.rs:8, redundant UnixListener import in supervisor.rs:9 (re-imported inside run_supervisor), unused state_json function in state.rs:142.
  • Single commit message "https://github.com/bytehooks/sandldd sandbox-oci crate and test cases…" looks truncated/garbled and references an unrelated repo. Squash to a clean message before merge.
  • PR head and base are both main — feature work on the contributor's main complicates rebases and reviews.
  • tests/containerd/test_containerd.sh mutates /etc/containerd/config.toml and restarts containerd. Cleanup is EXIT-trapped and sed-based; ^C between cp backup and restart leaves the host containerd misconfigured.
  • tests/kubernetes/test_kind.sh pins kindest/node:v1.30.0 — worth an env override.

Recommendation

Issues 1 and 2 are blocking: the runtime doesn't currently apply any spec-derived isolation, and the test suite that's keeping CI green doesn't exercise the modules in question. Once spec→policy is plumbed through to a real chroot/exec path inside the rootfs, and at least one end-to-end test creates a container and asserts on its confined view, the rest is small cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants