Skip to content

feat(parser): read Claude/Codex sessions from S3-compatible object storage#650

Open
DanielMao1 wants to merge 2 commits into
kenn-io:mainfrom
DanielMao1:feat/s3-session-source
Open

feat(parser): read Claude/Codex sessions from S3-compatible object storage#650
DanielMao1 wants to merge 2 commits into
kenn-io:mainfrom
DanielMao1:feat/s3-session-source

Conversation

@DanielMao1

@DanielMao1 DanielMao1 commented Jun 12, 2026

Copy link
Copy Markdown

Summary

Lets claude_project_dirs / codex_sessions_dirs entries be s3:// URIs.
When a configured root is an s3:// URI, Claude and Codex sessions are listed
and fetched directly from an S3-compatible object store (AWS S3, MinIO, Aliyun
OSS, Cloudflare R2) and parsed into the local SQLite DB like any other source.
Pure Go via minio-go — no cgo, no change to the default build or
cross-compilation.

Motivation

I'm a heavy Claude Code (and Codex) user and a big fan of this project — it's
become how I actually review what my agents did. But my sessions are scattered:
some on my Mac, some on a personal EC2 box, a lot more on my company's GPU
cluster. What I really want is one agentsview that gives me an integrated
view and analytics across all of them, not a separate dashboard per machine.

sync --host and PG push already cover part of this, but both assume the central
instance can reach each machine: SSH needs inbound access, and PG push needs a
daemon + Postgres on every box. The cluster nodes I can't SSH into from home, and
ephemeral cloud instances are often gone by the time I'd want to pull from them.

So this goes the other direction — push, not pull. Each machine drops its own
~/.claude/projects / ~/.codex/sessions into its own prefix of a shared bucket
(a plain aws s3 sync / rclone cron), and one agentsview reads them all from
object storage. Nothing has to reach back into the source machines; they just
push before they disappear. It sits alongside sync --host and PG push rather
than replacing either.

How it works

The two touch points are discovery and reading bytes; the per-agent
parsers are untouched.

  • DiscoverClaudeProjects / DiscoverCodexSessions detect an s3:// root and
    list objects, reconstructing the same project / subagent layout the local
    walkers produce.
  • processS3Session downloads each object, buffers it to a transient temp file
    so the existing path-based parsers (incremental offsets, subagent paths) run
    unchanged, then deletes it — no persistent local mirror. The parsed session
    records the original s3:// URI as its source path.
  • Source machine is derived from a .../<machine>/raw/... layout, mirroring the
    host: prefix that SSH sync already attaches to pulled sessions.

Credentials come from standard AWS env vars (AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_S3_ENDPOINT). AWS_S3_ENDPOINT
selects the S3-compatible endpoint (empty = AWS).

Where to look

  • internal/parser/s3source.go — the S3 client, listing, fetch, and discovery (new)
  • internal/parser/discovery.gos3:// dispatch + DiscoveredFile.Machine
  • internal/sync/engine.goprocessS3Session and the processFile short-circuit

Tradeoffs / limitations

  • Each object is read whole into memory (session JSONL is MB-scale, fine).
  • Every sync re-fetches all objects; no ETag/LastModified incremental yet — a
    natural follow-up using the metadata ListObjects already returns.
  • session export <id> is local-only and doesn't fetch from S3 yet.
  • Adds one dependency, github.com/minio/minio-go/v7.

Open question

Happy to adjust the shape. If you'd rather this be a dedicated [[remotes]]-style
config block (in the spirit of #412) instead of overloading the existing *_dirs
keys with s3://, I can rework it that way — wanted to keep the diff minimal
first and get your read on whether an object-store source is something you'd want
in tree at all.

…orage

When a claude_project_dirs / codex_sessions_dirs entry is an s3:// URI,
sessions are listed and fetched from object storage (AWS S3, MinIO, Aliyun
OSS, Cloudflare R2) via minio-go — pure Go, no cgo. Each object is
downloaded, buffered to a transient temp file so the existing path-based
parsers run unchanged, then removed; no persistent local mirror. The parsed
session records the original s3:// URI as its source path.

This is a push-based alternative to SSH remote sync (kenn-io#412): each machine
pushes its sessions to its own S3 prefix on its own schedule and a central
agentsview reads them, with no inbound SSH to each machine required. The
source machine is derived from the .../<machine>/raw/ path layout, mirroring
the host prefix SSH sync attaches.

Credentials come from standard AWS env vars (AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_REGION, AWS_S3_ENDPOINT).
@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (9d4977b)

S3 sync support has one High security issue and several Medium correctness/performance issues that should be fixed before merge.

High

  • internal/sync/engine.go:3378 - processS3Session derives a local temp path from the raw S3 object key basename and passes it to filepath.Join. S3 keys can contain backslashes, which are path separators on Windows, allowing an object key such as project/..\..\AppData\Roaming\somewhere\evil.jsonl to write outside the temp directory.

    Fix: Avoid using object-key text as a filesystem path component. Use os.CreateTemp(dir, "session-*.jsonl"), or strictly sanitize with forward-slash path.Base, rejecting \, OS separators, drive/volume names, ., and ...

Medium

  • internal/sync/engine.go:3379 - S3 sessions are processed with temp-file metadata and temp paths, so stored file_path and mtime lookups never match on later syncs. Unchanged S3 objects will be fully parsed and reported as synced on every periodic sync.

    Fix: Carry S3 object metadata through discovery or StatObject, use the original s3:// path for DB skip/incremental lookups, and persist stable remote metadata such as mtime or an etag-derived hash.

  • internal/sync/engine.go:5730 - The new S3 path is persisted as file_path, but FindSourceFile only returns stored paths that pass os.Stat. Any s3:// session therefore cannot be found for single-session resync/source-mtime paths even though processFile can process S3 paths directly.

    Fix: Treat stored s3:// paths as valid source paths and route them through processS3Session, or add S3-aware FindSourceFunc handling.

  • internal/parser/s3source.go:88 - FetchS3Object reads the entire object into memory before writing it back to a temp file, which can spike memory or fail on large JSONL sessions.

    Fix: Stream the GetObject reader directly to the temp file with io.Copy, avoiding the intermediate []byte.


Panel: ci_default_security | Synthesis: codex, 14s | Members: codex_default (codex/default, done, 5m19s), codex_security (codex/security, done, 2m58s) | Total: 8m31s

…okup

- processS3Session derives the temp filename from path.Base and rejects names
  containing a path separator or "." / "..", so an S3 object key can never
  write outside the temp dir (e.g. a key embedding a backslash on Windows).
- FetchS3Object returns a streaming reader; processS3Session io.Copy's it
  straight to the temp file instead of buffering the whole object in memory.
- FindSourceFile returns stored s3:// paths as-is (no local file to stat) so
  single-session resync routes them back through processS3Session.
@DanielMao1

DanielMao1 commented Jun 12, 2026

Copy link
Copy Markdown
Author

The issues reviewed are addressed in 598e6bd.

  • High (path traversal): Fixed. processS3Session now derives the temp filename from path.Base and rejects any name containing a path separator or ./.., so an object key can never escape the temp dir (including a key embedding a \ on Windows).
  • FetchS3Object buffers the whole object in memory: Fixed. It now returns the GetObject reader and processS3Session io.Copys it straight to the temp file — no intermediate []byte.
  • FindSourceFile can't resolve s3:// paths: Fixed. Stored s3:// paths are returned as-is (there's no local file to stat), so single-session resync routes them back through processS3Session.
  • No ETag/LastModified incremental — full re-parse every sync: Left as a follow-up for now (it's called out in the PR limitations). ListObjects already returns ETag + LastModified, so the plan is to carry that through discovery and skip unchanged objects against the stored hash/mtime. Happy to fold it into this PR if you'd rather it land together than as a follow-up — just say the word.

@roborev-ci

roborev-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown

roborev: Combined Review (598e6bd)

S3 support needs a couple of medium-severity fixes before merge.

Medium

  • internal/sync/engine.go:3400
    S3 sessions store the temporary file’s mtime as file_mtime, so unchanged objects get a fresh mtime on every sync. That bypasses skip detection and makes every S3 session look newly changed.
    Fix: Carry S3 object metadata from listing/stat, use that size/mtime for skip checks and stored session metadata, and look up skips by the s3:// URI.

  • internal/sync/engine.go:470
    Per-session sync by ID/path cannot refresh S3-backed sessions. service.Sync passes the stored s3:// file path to SyncPaths, but classifyOnePath returns before classifying non-local paths.
    Fix: Teach classifyOnePath/SyncPaths to recognize stored s3:// Claude/Codex paths, or route ID-based sync through SyncSingleSession.


Panel: ci_default_security | Synthesis: codex, 9s | Members: codex_default (codex/default, done, 5m47s), codex_security (codex/security, done, 1m57s) | Total: 7m53s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant