Skip to content

Docker hardened images#104

Merged
andhreljaKern merged 15 commits into
devfrom
hardened-images
Jun 11, 2026
Merged

Docker hardened images#104
andhreljaKern merged 15 commits into
devfrom
hardened-images

Conversation

@lumburovskalina

@lumburovskalina lumburovskalina commented May 26, 2026

Copy link
Copy Markdown
Contributor

Main PR

Related PRs:
Parent-images:

Application repos:

CI/CD:

@JWittmeyer

JWittmeyer commented May 27, 2026

Copy link
Copy Markdown
Member

Got an unauthorized path, though not sure if through this change
GET http://localhost:4455/refinery/images/refinery-favicon.ico 401 (Unauthorized)
And
530-7847636463c2b950.js:6 POST http://localhost:4455/.ory/kratos/public//self-service/login?flow=83295f17-1d36-41cf-b213-88849c9a566c 422 (Unprocessable Entity)

I still can log in so i am not sure what this is the result of

  • resolved
  • not part of the change

@JWittmeyer

JWittmeyer commented May 27, 2026

Copy link
Copy Markdown
Member

Noticed (not part of PR but small change so probably easy to do)
The project Modal in cognition still has the tab "PDF Upload". Since we now can use different we should rename to "File Upload"

  • resolved
  • not relevant (new backlog item created)

@JWittmeyer

JWittmeyer commented May 27, 2026

Copy link
Copy Markdown
Member

I get errors when i try to bash start the cognition-ui

> next dev

 ⚠ Invalid next.config.js options detected:
 ⚠     Unrecognized key(s) in object: 'swcMinify'
 ⚠ See more info here: https://nextjs.org/docs/messages/invalid-next-config
   ▲ Next.js 15.5.18
   - Local:        http://localhost:3000
   - Network:      http://172.18.0.23:3000
   - Experiments (use with caution):
     ✓ scrollRestoration

 ✓ Starting...
[Error: EACCES: permission denied, unlink '/app/.next/build-manifest.json'] {
  errno: -13,
  code: 'EACCES',
  syscall: 'unlink',
  path: '/app/.next/build-manifest.json'
}
node:events:495
      throw er; // Unhandled 'error' event
      ^

Error: EACCES: permission denied, open '/app/.next/trace'
Emitted 'error' event on WriteStream instance at:
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -13,
  code: 'EACCES',
  syscall: 'open',
  path: '/app/.next/trace'
}

Node.js v18.20.8
npm notice
npm notice New major version of npm available! 10.8.2 -> 11.15.0
npm notice Changelog: https://github.com/npm/cli/releases/tag/v11.15.0
npm notice To update run: npm install -g npm@11.15.0
npm notice

stopping container...            [done]


  • resolved
  • user problem

@JWittmeyer JWittmeyer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review (review-bot)

Comment thread Dockerfile Outdated
Comment thread Dockerfile Outdated
Comment thread Dockerfile
Comment thread dev.Dockerfile Outdated
Comment thread dev.Dockerfile
Comment thread Dockerfile Outdated
Comment thread .dockerignore
Comment thread .drone.yml Outdated
@JWittmeyer

JWittmeyer commented May 27, 2026

Copy link
Copy Markdown
Member

🤖 Review Bot Summary

Risk Level: HIGH

Count
🔴 Critical 35
🟡 Suggestion 76
🔵 Note 101
❓ Question 27
Total 239

Per-Repo Breakdown

Repo 🔴 🟡 🔵 Total PR
✅ admin-dashboard 0 3 3 0 6 #113
⚠️ cognition-etl-provider 3 3 1 1 8 #69
⚠️ cognition-exec-env 1 2 2 1 6 #85
⚠️ cognition-gateway 1 3 1 2 7 #352
⚠️ cognition-graphrag 2 4 3 1 10 #39
⚠️ cognition-integration-provider 3 2 4 2 11 #107
⚠️ cognition-pdf2md 1 2 4 1 8 #10
✅ cognition-task-master 0 4 2 1 7 #59
⚠️ cognition-ui 1 2 6 2 11 #222
⚠️ refinery-ac-exec-env 1 3 3 1 8 #102
✅ refinery-authorizer 0 2 7 0 9 #78
⚠️ refinery-common-parent-image 2 3 4 0 9 #13
⚠️ refinery-embedder 1 4 7 0 12 #195
⚠️ refinery-entry 1 3 3 2 9 #44
⚠️ refinery-exec-env-parent-image 1 2 3 0 6 #9
⚠️ refinery-gateway 1 1 7 0 9 #413
✅ refinery-lf-exec-env 0 0 8 0 8 #104
⚠️ refinery-mini-parent-image 1 3 3 0 7 #7
✅ refinery-ml-exec-env 0 2 5 1 8 #112
⚠️ refinery-model-provider 2 4 1 2 9 #89
⚠️ refinery-neural-search 2 2 3 1 8 #121
⚠️ refinery-next-parent-image 1 2 1 2 6 #14
⚠️ refinery-tokenizer 2 4 4 1 11 #118
✅ refinery-torch-cpu-parent-image 0 3 3 1 7 #8
⚠️ refinery-torch-cuda-parent-image 2 0 4 0 6 #8
✅ refinery-ui 0 4 2 2 8 #104
✅ refinery-weak-supervisor 0 4 3 1 8 #94

📋 General Findings

These findings are not tied to a specific file and are listed here instead of as inline comments.

  • QUESTION (cognition-exec-env): Is there a coordinated change in cognition-gateway (or compose/k8s manifests) that switches exec-env container startup from user="code_runner" to UID 65532? Without it, code execution containers will not start after this image is deployed.

🔗 Cross-Repo Findings

  • 🔴 [code-kern-ai/cognition-exec-env, code-kern-ai/cognition-gateway] cognition-exec-env removes the code_runner user (RUN useradd ... code_runner deleted, runtime is now UID 65532), but cognition-gateway PR #352 does not update run_container(), which still starts exec-env containers with user="code_runner". Docker will reject container start with an unknown user, breaking all pipeline Python code execution.
  • 🔴 [code-kern-ai/refinery-torch-cuda-parent-image, code-kern-ai/refinery-embedder] refinery-torch-cuda-parent-image replaces nvidia/cuda:13.0.2-base-ubuntu22.04 with dhi.io/python:3.11.11-debian12 in both builder and runtime stages. No CUDA toolkit or NVIDIA driver runtime remains in the published v2.6.0-torch-cuda parent tag. refinery-embedder gpu.Dockerfile still inherits that tag and installs torch-cuda wheels; GPU inference will fail at runtime without libcuda/NVIDIA stack.
  • 🔴 [code-kern-ai/refinery-tokenizer, code-kern-ai/refinery-neural-search, code-kern-ai/cognition-etl-provider, code-kern-ai/cognition-integration-provider, code-kern-ai/refinery-embedder, code-kern-ai/refinery-model-provider, code-kern-ai/refinery-authorizer, code-kern-ai/refinery-weak-supervisor, code-kern-ai/cognition-gateway, code-kern-ai/cognition-task-master, code-kern-ai/cognition-graphrag, code-kern-ai/refinery-gateway] Eleven application repos add USER 65532:65532 while keeping uvicorn bound to port 80. On Linux, unprivileged UIDs cannot bind ports < 1024 without CAP_NET_BIND_SERVICE or setcap. These images previously ran as root (no USER directive), so this is a coordinated production break across the Python service fleet unless orchestration grants extra capabilities.
  • 🔴 [code-kern-ai/cognition-etl-provider, code-kern-ai/refinery-exec-env-parent-image] cognition-etl-provider downloads NLTK corpora in the builder stage with default NLTK_DATA (typically /root/nltk_data), but the runtime stage only copies ${VENV_PATH} and /app. NLTK data is never copied or re-downloaded, so tokenization/POS tagging will fail at runtime. refinery-exec-env-parent-image correctly sets NLTK_DATA=/opt/nltk_data and copies it; cognition-etl-provider does not follow that pattern.
  • 🔴 [code-kern-ai/refinery-ml-exec-env, code-kern-ai/refinery-ac-exec-env, code-kern-ai/refinery-lf-exec-env] refinery-ml-exec-env builder copies the repo to /app (including run.sh), but runtime keeps ENTRYPOINT ["/run.sh"]. refinery-ac-exec-env and refinery-lf-exec-env correctly changed to ENTRYPOINT ["/app/run.sh"]. ml-exec-env will fail with 'executable file not found' because /run.sh is no longer at the filesystem root.
  • 🔴 [code-kern-ai/cognition-pdf2md, code-kern-ai/refinery-next-parent-image] cognition-pdf2md production runtime switches to minimal dhi.io/node:20-debian12 but keeps CMD ["npm", "start"]. DHI runtime images typically omit npm/node build tooling present only in the -dev builder tag. The builder installs deps and copies the tree, but starting via npm requires node/npm on PATH in the runtime image.
  • 🟡 [code-kern-ai/refinery-exec-env-parent-image, code-kern-ai/refinery-next-parent-image, code-kern-ai/refinery-common-parent-image, code-kern-ai/refinery-mini-parent-image] Four parent-image repos (exec-env, next, common, mini) changed arm64 pipeline cache_from from arch-specific tags (*-arm64) to amd64 tags (*-exec-env, *-next, etc.). Pulling wrong-architecture cache layers on arm64 builds causes cache misses or invalid layer reuse, slowing CI and potentially producing inconsistent arm64 images across the parent-image fleet.
  • 🟡 [code-kern-ai/refinery-exec-env-parent-image, code-kern-ai/refinery-torch-cpu-parent-image, code-kern-ai/refinery-torch-cuda-parent-image, code-kern-ai/refinery-next-parent-image, code-kern-ai/refinery-common-parent-image, code-kern-ai/refinery-mini-parent-image, code-kern-ai/refinery-gateway, code-kern-ai/refinery-embedder, code-kern-ai/refinery-model-provider] All six parent-image repos rebuild with DHI bases, non-root UID 65532, and multi-stage layouts, but every application repo keeps the same mutable PARENT_IMAGE tag (e.g. v2.5.0-common, v2.6.0-torch-cpu) without a version bump. Publishing parent images before application images land—or republishing tags mid-rollout—silently changes UID, base OS, and filesystem layout for all consumers. Coordinate release order and consider semver bumps for breaking hardening changes.
  • 🔵 [code-kern-ai/refinery-common-parent-image, code-kern-ai/refinery-mini-parent-image, code-kern-ai/refinery-torch-cpu-parent-image, code-kern-ai/refinery-exec-env-parent-image, code-kern-ai/refinery-ml-exec-env, code-kern-ai/refinery-embedder, code-kern-ai/refinery-model-provider, code-kern-ai/cognition-etl-provider] The provided evidence does not include parent Dockerfiles showing USER 65532:65532, nor the cited downstream files (cognition-etl-provider runtime Dockerfile with apt-get, refinery-ml-exec-env/run_ml.py writing to /inference). Without those sources, the non-root inheritance risk and writable-path audit cannot be confirmed or dismissed; the concern remains plausible but unverified from the supplied snippets.
  • 🟡 [code-kern-ai/refinery-entry, code-kern-ai/cognition-ui, code-kern-ai/admin-dashboard, code-kern-ai/refinery-gateway, code-kern-ai/cognition-gateway, code-kern-ai/cognition-integration-provider, code-kern-ai/cognition-graphrag, code-kern-ai/cognition-pdf2md, code-kern-ai/refinery-tokenizer] At least nine repos add USER 65532:65532 to dev images after root-owned COPY/npm install/pip install without --chown. Bind-mounted host checkouts (VOLUME /app) combined with --reload or npm run dev will hit EACCES on writes. This breaks the shared local-dev workflow across Python and Next.js services in the group.
  • 🟡 [code-kern-ai/cognition-graphrag] cognition-graphrag production moves to /opt/venv/bin/uvicorn but dev.Dockerfile still invokes /usr/local/bin/uvicorn after adding USER 65532:65532. Dev and prod entrypoints diverge within the same PR group, making local reproduction of production behavior unreliable.
  • 🟡 [code-kern-ai/refinery-gateway, code-kern-ai/cognition-gateway, code-kern-ai/refinery-tokenizer, code-kern-ai/refinery-neural-search, code-kern-ai/refinery-embedder, code-kern-ai/refinery-authorizer, code-kern-ai/refinery-ui, code-kern-ai/cognition-ui, code-kern-ai/admin-dashboard] Security-hardening PR removes build tooling and adds non-root runtime across ~20 services, but none of the changed production Dockerfiles add HEALTHCHECK directives (cognition-exec-env retains its existing one). Orchestrators cannot detect hung/unready containers after hardening. Services already expose /healthcheck or /health routes—wire them up consistently.
  • 🔵 [code-kern-ai/refinery-exec-env-parent-image, code-kern-ai/refinery-torch-cpu-parent-image, code-kern-ai/refinery-torch-cuda-parent-image, code-kern-ai/refinery-next-parent-image, code-kern-ai/refinery-common-parent-image, code-kern-ai/refinery-mini-parent-image] All parent-image and many application pipelines now authenticate DHI base-image pulls via base_image_registry: dhi.io using dockerhub_username/dockerhub_password secrets. Reusing Docker Hub credentials for a third-party hardened-image registry expands blast radius if CI secrets leak and makes credential rotation coupled across registries.
  • 🔵 [code-kern-ai/refinery-common-parent-image, code-kern-ai/refinery-mini-parent-image, code-kern-ai/refinery-exec-env-parent-image, code-kern-ai/refinery-model-provider, code-kern-ai/cognition-exec-env] Evidence confirms arm64 pipelines consistently pin plugins/docker:21.2.8-linux-amd64 while platform.arch is arm64 (refinery-common-parent-image, refinery-next-parent-image including arm64-dockerhub, refinery-torch-cpu-parent-image). This implies amd64 plugin execution on arm64-targeted builds—typically QEMU on amd64 runners. Whether that is safe depends on runner architecture, which is not defined in the repository configs; on native arm64 runners this plugin choice may fail to execute.
  • 🔵 [code-kern-ai/refinery-entry, code-kern-ai/refinery-ui, code-kern-ai/cognition-ui] refinery-entry PR Parent image update #44 bumps submodule pointers for submodules/javascript-functions (60f9eeba) and submodules/react-components (37dbcb486d1f), but no linked PR for those submodule repos appears in this group. If submodule commits introduce shared-component API changes, refinery-ui/cognition-ui/admin-dashboard consumers may be out of sync.
  • [code-kern-ai/cognition-gateway, code-kern-ai/refinery-gateway, code-kern-ai/refinery-tokenizer, code-kern-ai/refinery-neural-search] No API contract, endpoint rename, or DB→API→frontend field changes appear in this PR group—all changes are Docker/CI hardening. Is there a separate deployment manifest PR (K8s/compose) that updates targetPort, capabilities (CAP_NET_BIND_SERVICE), exec-env user mapping, or ingress routes to match the new non-root/port-80 behavior?
  • [code-kern-ai/cognition-exec-env, code-kern-ai/cognition-gateway] Is the code_runner→65532 UID change intentional without a cognition-gateway update, or is a gateway PR planned outside this group? Without it, Python pipeline steps cannot spawn exec-env containers.

Inline comments with details are posted on each PR above.

Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
Comment thread .drone.yml Outdated
@andhreljaKern

andhreljaKern commented May 28, 2026

Copy link
Copy Markdown
Contributor

Please do a "Find and replace" across all the affected repos, from:

base_image_registry: registry.dev.kern.ai
base_image_username:
  from_secret: docker_username
base_image_password:
  from_secret: docker_password

to

base_image_registry: dhi.io
base_image_username:
  from_secret: dockerhub_username
base_image_password:
  from_secret: dockerhub_password
  • resolved

P.S. we can also exclude the drone.yml changes in app repos because they don't need dhi.io authentication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants