A local-first, auditable knowledge platform with Graph-RAG retrieval, deterministic contradiction arbitration, an append-only audit log, and replayable knowledge state. Its one differentiator: Replayable RAG β the system's state at any past point can be reconstructed byte-identically from the audit log alone (
reconstruct_graph_at(t)).Built as a general mother platform through v1.0; domain packs (legal, food, retail, β¦) branch off only after v1.0 (see
docs/PLATFORM_READINESS.md).
νκ΅μ΄ README Β· π μ²μ μμνμλ λΆ (10μ΄λ λ°λΌν μ μμ΄μ)
A note on naming. SEKOS is the product brand (UI, marketing, product framing). JAMES is the internal codename for the reasoning engine inside it β the RAG + audit + lifecycle core. The repository, source comments, environment variables (
JAMES_*), CLI flags (--sut james), the two benchmarks (RAB / LRB), and the citation BibTeX all keep the JAMES name because the published artifacts (Zenodo DOI, preprints, SPECs) were registered under it and must stay reproducible. Read SEKOS on the sidebar; read JAMES ingrep -r '^# ' core/.
Most production RAG stacks today (LangChain, LlamaIndex, vanilla retrieval-augmented quickstarts) optimise for answer quality on a frozen corpus. SEKOS β running the JAMES reasoning engine β is built for the next two axes those frameworks leave unmeasured:
| Axis | LangChain / LlamaIndex / vanilla RAG | SEKOS (JAMES engine) |
|---|---|---|
| Audit-native lifecycle | logger.info() strings; no canonical event taxonomy; replay impossible from logs alone |
Event-sourced audit_log schema; reconstruct_graph_at(t) replays system state byte-identically from log alone β measured on RAB v0.1.1 (AC/RF/PC = 1.0 Γ 3 vs Baseline-0 default-logging floor = 0.275/0/0) |
| Time-valid retrieval | Latest version only; cannot answer "what was this contract's clause 6 months ago?" without an external versioned store | Per-document validity windows (T1) + supersede chain (T7); time-travel queries return the version valid at query_time β measured on LRB v0.2.3 (R@1 Vanilla < Naive < JAMES preserved across 4 models Γ 4 scales, JAMES β Naive gap > +0.10 throughout) |
| Local-first execution | Cloud-default (OpenAI / Anthropic API calls in every retrieval) | Runs on local Ollama (gemma4:e4b 4B β mxtral 47B); cloud is opt-in per query; data never leaves the host without explicit consent |
| EU AI Act 2026-08 alignment | "Compliance" is a TODO | RAB's 3 metrics map verbatim to Articles 10/12/19; the benchmark is the audit instrument the Act assumes exists |
What SEKOS does not claim:
- Better answer quality on closed-book QA β not better, but measured parity: Vanilla, Naive, and JAMES produce 4-decimal-identical EM/F1 on MuSiQue (gemma4:e4b / gemma3:12b / mxtral 47B), so JAMES adds no reasoning degradation. The closed-book score is the backbone model's capability; the validity-window mechanism is retrieval-side and orthogonal to closed-book reasoning. See LRB preprint Β§5.5.
- Novel architecture β ActiveGraph (arXiv:2605.21997) demonstrates the same event-sourced runtime class independently; the benchmark, not the runtime, is the contribution.
- Drop-in LangChain replacement β SEKOS is a platform with a different operational model (audit-first); migration is an integration project, not a one-line
pip install.
If your use case is audit / lifecycle / time-travel / on-prem β SEKOS is built for it, measured for it, and citeable for it. If your use case is fastest possible answer on a fixed corpus β use LangChain.
Looking for MRR / NDCG / RAGAS / hallucination-rate coverage? See
docs/evaluation/v0.5-evaluation-coverage-mapping.mdβ a full mapping of standard RAG / IR metrics to what SEKOS measures (and what it deliberately doesn't), including code paths and procurement-ready answers.Looking for a one-page comparison vs LangChain / LlamaIndex / Haystack / R2R / ActiveGraph? See
docs/evaluation/v0.5-industry-comparison.mdβ three matrices (architectural capability presence / public benchmark headline coverage / reproducibility tier) with every SEKOS cell pinned to a committed artifact.
- Python 3.11+
- Ollama installed and running
- Min 16GB RAM (32GB+ recommended)
- (Optional) NVIDIA GPU for faster inference
- (Optional) Tavily API key for web search (free 1k/month)
git clone https://github.com/Hashevolution/James-RAG-Evol
cd James-RAG-Evol
# Configure environment
cp .env.example .env
# Edit .env β set JAMES_API_KEY, JAMES_JWT_SECRET
# Install dependencies
pip install -r requirements.txt
# Start the server (admin wizard auto-recommends a model on first login)
python server_llmwiki.pyOpen http://localhost:8000/admin β the admin wizard measures your
hardware and offers a one-click install of an appropriate Ollama
model. Then open http://localhost:8000 for the chat UI.
Want to reproduce the published benchmark numbers instead of running the product? Jump to Reproduce in 60 seconds.
flowchart TB
classDef user fill:#fde7e7,stroke:#c33,color:#000
classDef sec fill:#fff7d6,stroke:#b8860b,color:#000
classDef pipe fill:#e0f2fe,stroke:#0369a1,color:#000
classDef life fill:#dcfce7,stroke:#15803d,color:#000
classDef store fill:#f3e8ff,stroke:#7e22ce,color:#000
classDef bench fill:#ffedd5,stroke:#c2410c,color:#000
USER[/"User query<br/>(REST / CLI / UI)"/]:::user
SEC["3-Stage Security<br/>RBAC + ABAC + Instruction Isolation<br/>(core/security/)"]:::sec
subgraph RETRIEVAL["Retrieval pipeline (core/)"]
direction TB
RETR["Hybrid Retrieval<br/>BM25 + dense embed (BAAI/bge-m3)<br/>core/retrieval/"]:::pipe
GRAPH["Graph-RAG ontology walk<br/>12 typed relations<br/>core/graph_engine.py"]:::pipe
REASON["Reasoning loop<br/>plan β retrieve β reflect β verify β synth<br/>core/reasoning/engine.py"]:::pipe
end
subgraph LIFECYCLE["Layer 4 Lifecycle (T1-T7)"]
direction TB
T1["T1 Temporal validity<br/>(valid_from, valid_to)"]:::life
T2["T2 Contradiction arbitration<br/>4-rule deterministic tree<br/>core/lifecycle/contradiction_arbiter.py"]:::life
T5["T5 Replayable Audit Graph<br/>reconstruct_graph_at(t)<br/>core/lifecycle/replay_graph.py"]:::life
T6["T6 Causality cascade<br/>invalidate_derived_facts"]:::life
T7["T7 Supersede chain<br/>supersede_by + supersede_at"]:::life
end
subgraph STORE["Storage (default local)"]
direction TB
CHROMA[("ChromaDB<br/>vector store")]:::store
WIKI[("wiki/<br/>doc + metadata")]:::store
AUDIT[("audit.db<br/>append-only audit log")]:::store
MEM[("memory/<br/>session state")]:::store
end
subgraph LLM["LLM backends (default local)"]
direction LR
OLLAMA["Ollama<br/>gemma4:e4b default"]
CLOUD["Cloud (opt-in)<br/>claude / openai / gemini"]
end
subgraph BENCHES["Pre-registered deterministic benchmarks (v0.4.3 / v0.4.4)"]
direction LR
RAB["RAB v0.1.1<br/>Audit Completeness / Replay Fidelity / Provenance Coverage<br/>EU AI Act Art. 10/12/19 anchor<br/>papers/rab-preprint/"]:::bench
LRB["LRB v0.2.3<br/>Temporal validity (query_time, valid_time)<br/>R@1 V<N<J Γ 4 model Γ 4 scale<br/>papers/lrb-preprint/"]:::bench
end
USER --> SEC
SEC --> RETR
RETR --> GRAPH
GRAPH --> REASON
REASON --> CHROMA
REASON --> WIKI
REASON --> AUDIT
REASON <--> LIFECYCLE
AUDIT --> T5
T1 --> T7
T2 --> T6
T7 --> T6
REASON --> OLLAMA
REASON -.opt-in.-> CLOUD
AUDIT -."scored by".-> RAB
GRAPH -."scored by".-> RAB
RETR -."scored by".-> LRB
LIFECYCLE -."scored by".-> LRB
The flow in one sentence: a user query passes through 3-stage security (RBAC + ABAC + instruction isolation), enters the retrieval pipeline (hybrid BM25 + dense embed β Graph-RAG ontology walk β reasoning loop), reads + writes the Layer 4 lifecycle store (T1-T7), and is replayable from the audit log via reconstruct_graph_at(t). RAB scores the audit log; LRB scores the retrieval quality on time-travel queries. Both benchmarks are external deterministic instruments β JAMES does not score itself.
[User Query]
β
[Security Filter] β injection patterns + PolicyEngine pre-check
β
[Query Router] β chat / coding / retrieval / web_search
β
[Query Rewriter] β LLM rewrite (opt-in, JAMES_ENABLE_QUERY_REWRITE)
β
[Hybrid Search] β Vector(60%) + BM25(20%) + keyword(10%) + name(10%)
β
[Cross-Encoder Rerank] β MiniLM-L-6-v2 (default ON; JAMES_DISABLE_RERANK=1 to disable)
β
[Graph Engine] β DFS + sources-aware + sensitivity gating
β
[Reasoning Loop] β retrieve β expand β reflect (opt-in) β verify (opt-in)
β
[Tool Router] β read tools direct; write tools β Change Request
β
[Output Filter] β PII masking + role-based filter
β
[Answer + Reasoning Path + trace_id]
Every stage emits a row tied to one trace_id.
scripts/replay_trace.py <trace_id> reconstructs the full sequence
from audit_log. See docs/ARCHITECTURE.md Β§5.7
for the Cognitive Layer design.
James-RAG-Evol/
βββ core/
β βββ reasoning/ retrieval/reflection/verification/tool router
β βββ retrieval/ hybrid search + cross-encoder reranker + query rewriter
β βββ memory/ long-term memory (db / conversation / summaries)
β βββ plugins/ plugin contract surface (Provider Protocol)
β βββ policy_engine.py single point of role/sensitivity decisions
β βββ change_request.py propose/review/approve write primitive
β βββ cascade.py file delete/modify β graph surgical update
β βββ graph_editor.py edge edit (replace/append/delete) + bidirectional sync
β βββ ...
βββ eval/ STEP 7 regression baseline + RAGAS suite
βββ llm/ LLM provider abstraction
βββ tools/ Capability-token gated tool modules
βββ frontend/ Web UI (HTML + JS)
βββ processors/ File preprocessing
βββ wiki/ Knowledge graph (markdown + sources)
βββ memory/ Long-term memory DB
βββ workspace/ Change requests, patches, proposals
βββ scripts/ bench.py / replay_trace.py / ops scripts
βββ reports/ Eval results + promo assets
βββ docs/ ARCHITECTURE / PLATFORM_READINESS / ROADMAP / handovers
βββ server_llmwiki.py Main server entry point
Most RAG systems answer one question: "what's the answer?" SEKOS answers two extra:
- What did the system know at time T? β T7 supersede chains
preserve historical fact states;
reconstruct_view_at(t)returns the edge that was active at any past timestamp, even after unrelated CASCADE delete events. - Why did the system say that? β every reasoning step (query
rewrite, retrieval, rerank, planner, reflect, verify, synth)
writes an append-only audit row.
scripts/replay_trace.py <trace_id>reconstructs the full sequence byte-identically.
The two combined make SEKOS a Replayable RAG system β a
category distinct from Agentic RAG (which optimises for what an
AI can do) and from Mem0-style memory layers (which use an LLM
judge to update beliefs). SEKOS updates beliefs via a
deterministic 4-rule decision tree (core/lifecycle/contradiction_arbiter.py)
that is LLM-free by design, and preserves both the old and the new
fact for replay rather than overwriting.
- Deterministic memory lifecycle (v0.4.0) β T1 Temporal
Validity + T7 Supersede Chain + T2 Contradiction Arbitration.
CASCADE (destructive, Layer 3) and EVENT (history-preserving,
Layer 4) are guaranteed-separate paths β release-gated by
tests/test_t7_release_gating_invariants.pyagainst the real wiki fixture. - Sources-aware Graph-RAG β 12 typed relations carry semantic
meaning beyond embeddings, and every relation carries
sources: [{doc_id, weight, role, ts}]so deleting or modifying a document surgically updates only the affected derived knowledge (Knowledge Cascade AβE, v0.3.0). - Cognitive Layer β cross-encoder reranker (default ON), LLM
query rewriter, reflection loop (draft β critique β revise),
verification engine (security + fact check), and tool router.
One
trace_idreconstructs the full 8-stage reasoning sequence viascripts/replay_trace.py. - PolicyEngine as a layer, not a sprinkle β single point of role / sensitivity decisions wired into retrieval, graph, output, and tools; removing it breaks 6+ modules (v0.2 Axis 4).
- Change Request primitive β every write (wiki edits, workspace jobs, self-evolution patches) routes through propose β review β admin approval β atomic apply β audit row. No silent writes.
- Self-evolution behind a human gate β feedback β candidate β
bench eval β human approval β deploy β auto-rollback on
regression. Every deployed patch has an
approver_usernameaudit row (v0.2 Axis 5). - 100% local β runs on a laptop with Ollama; no cloud LLM dependency by default.
Each feature is regression-tested against the STEP 7 20-query baseline + RAGAS metrics. PRs touching
core/{retrieval,graph,reasoning}cannot land without bench numbers.
A single 4-cell ablation on the multihop_rag fixture (N=100, n=3 paired runs, on the 4B local model gemma4:e4b, git commit b686f35):
| Cell | path_coverage | graded_answer | abstention_f1 | token_cost | latency |
|---|---|---|---|---|---|
| C_minus (no RAG) | 0.000 | 0.213 | 0.356 | 675 | 9.8s |
| C_rag-basic (+ vector) | 0.000 | 0.260 | 0.306 | 783 | 12.5s |
| C_rag-graph (+ graph) | 0.4056 | 0.203 | 0.400 | 1675 | 32.4s |
| C_rag-ontology (+ typed filter) | 0.4056 | 0.230 | 0.4286 | 1695 | 32.4s |
Graph-RAG contribution (C_rag-basic β C_rag-graph):
- path_coverage +0.41 (load-bearing win, noise band 0.02) β vector-only retrieval recovers 0% of gold supporting-doc paths on multi-hop queries; graph traversal recovers ~40%.
- abstention_f1 +0.094 (graph evidence improves "when to say I don't know" calibration).
- graded_answer β0.057 (honest loss β graph evidence adds noise to short-answer queries; typed-filter recovers +0.027).
- 2.1Γ token cost, 2.6Γ latency (the path-coverage win is not free).
Cross-time reproducibility: an earlier measurement cycle (2026-06-01, single run) measured path_coverage 0.408; this rerun (2026-06-13, n=3 median) confirms 0.4056. Stable across 12 days of oracle revisions.
Full table + LRB architecture ablation + RAB audit ablation + honest negatives (closed-book QA, deep-multi-hop floor, cost trade-offs) all in docs/evaluation/v0.5-graph-rag-contribution.md.
Two benchmarks released as a sibling pair, both pre-registered before measurement, both deterministic-scorer-only (no LLM judge), both committed in this repository.
Evidence tiers used throughout this README: βββ = pattern holds across models, scales, and reruns (strongest); ββ = confirmed in the tested scenario but absolute magnitude is scenario-sensitive; β = infrastructure exists / single-model or single-run only, not yet research-tier.
π PDF (10 pages) Β· π SPEC Β· π§ͺ Reproduce
RAB scores the exported audit-log artifact (Audit Completeness / Replay Fidelity / Provenance Coverage) of any RAG or agent system that can dump an append-only log. Three metrics map verbatim to EU AI Act Articles 10, 12, 19. Headline: the 4-SUT gap structure (Reference / JAMES audit-native / OpenTelemetry-GenAI bolt-on / vanilla default-logging) β not JAMES's absolute score.
π PDF (11 pages) Β· π§ͺ Reproduce
LRB scores temporal-validity (
query_time,valid_time) retrieval quality across three deterministic scenarios (S1 quarterly, S2 yearly-with-time-travel, S3 publication-scale 1000 docs). Three systems-under-test (Vanilla append-only / Naive-supersede / JAMES validity-window) compared on 7 deterministic axes + 3 exploratory top-1 axes. Headline: Vanilla < Naive < JAMES on R@1 preserved across 4 model families Γ 4 scale points (12.5Γ scale span) with the JAMES β Naive gap > +0.10 throughout.
One command (wraps everything below; deterministic core tier, no GPU/Ollama, ~2 min):
git clone https://github.com/Hashevolution/James-RAG-Evol.git
cd James-RAG-Evol
python -m pip install -r requirements.txt
bash benchmarks/run_all.sh # see benchmarks/README.md for --full / --with-llmOr run each benchmark by hand
# RAB scenario-S1 (deterministic; no LLM call; ~5 seconds)
python scripts/research/rab_run.py --sut reference # AC/RF/PC = 1.000/1.000/1.000 (gate)
python scripts/research/rab_run.py --sut james # AC/RF/PC = 1.000/1.000/1.000
python scripts/research/rab_run.py --sut baseline0 # AC/RF/PC = 0.275/0.000/0.000
# LRB Phase B (S2 time-travel) token-mode (deterministic; no LLM; ~30 seconds)
# Scenario fixtures are gitignored β build them first (deterministic, no LLM):
python scripts/research/build_lrb_scenario_s1.py
python scripts/research/build_lrb_scenario_s2.py
PYTHONPATH=. python scripts/research/lrb_run_phase_b.py --scenarios S1,S2
# β S2 R@1: Vanilla 0.225 < Naive 0.538 < JAMES 0.688 (JAMES β Naive gap +0.15)
# LRB S3 publication-scale (1000 docs / 5.6k events / 1000 queries; ~3 minutes)
python scripts/research/build_lrb_scenario_s3.py --scale publication
python scripts/research/lrb_run_s3.py --scale publicationEvery result.json + bench.jsonl artifact in reports/external/lrb/ and reports/rab/ is SHA-pinned against the scenario fixture; byte-identical re-runs are the verification protocol. Full reproducibility disclosure + the community reproduction program live in benchmarks/.
Click to expand
@misc{seo2026jamesv044,
author = {Seo, Jiwon},
title = {{PROJECT JAMES} v0.4.4 (LRB v0.2.3 S3 publication-scale + cycle $\gamma$ 4-bench infrastructure closure)},
year = {2026},
month = {6},
doi = {10.5281/zenodo.20652679},
url = {https://doi.org/10.5281/zenodo.20652679},
version = {v0.4.4},
publisher = {Zenodo},
note = {Source: https://github.com/Hashevolution/James-RAG-Evol}
}
@misc{seo2026rab,
author = {Seo, Jiwon},
title = {{RAB}: A Replayable-Audit Benchmark for {RAG} and Agent Systems Operationalising {EU AI Act} Articles 10, 12, 19},
year = {2026},
howpublished = {Preprint v0.1.1},
url = {papers/rab-preprint/main.pdf},
note = {Data: \href{https://doi.org/10.5281/zenodo.20652679}{10.5281/zenodo.20652679}}
}
@misc{seo2026lrb,
author = {Seo, Jiwon},
title = {{LRB}: A Lifecycle Retrieval Benchmark for Temporal {RAG}},
year = {2026},
howpublished = {Preprint v0.2.3},
url = {papers/lrb-preprint/main.pdf},
note = {Data: \href{https://doi.org/10.5281/zenodo.20652679}{10.5281/zenodo.20652679}}
}RAB v0.1.1 is a frozen benchmark spec + scenario fixture + deterministic scorer + adapter contract for systems that claim audit-replayable RAG / agent state. Three metrics, all deterministic (no LLM judge anywhere), each tied to a specific EU AI Act article:
| Metric | What it measures | EU AI Act anchor |
|---|---|---|
| AC β Audit Completeness | Are all required events present in the log? | Art. 12(1)/(2) |
| RF β Replay Fidelity | Can past state be reconstructed exactly from the log? | Art. 12(2)(b) post-market reconstruction |
| PC β Provenance Coverage | Is every fact traceable to its source? | Art. 10(2)(b) + W3C PROV |
Why a new benchmark: the Mathkar et al. 2026 agent-trace survey (arXiv:2606.04990) names "realistic execution-trace benchmarks" as an open challenge; RAB responds to that gap. The benchmark β not the audit-native runtime β is the contribution: ActiveGraph (arXiv:2605.21997) independently published the event-sourced log + replay architecture; RAB is what was missing.
Headline = the gap structure, not the JAMES engine's score (scenario-S1):
| SUT | AC | RF-exact | RF-graded | PC |
|---|---|---|---|---|
| reference (self-verify gate) | 1.000 | 1.000 | 1.000 | 1.000 |
| JAMES (audit-native) | 1.000 | 1.000 | 1.000 | 1.000 |
| Baseline-0 (vanilla quickstart + default logging) | 0.275 | 0.000 | 0.000 | 0.000 |
JAMES matching the reference on S1 is expected (SPEC Β§6.5). The audit-native vs default-logging delta is the finding. Evidence tier: ββ (scenario-S1 confirmed). Not a regulatory certification β RAB operationalises the AI Act's concepts into measurable form; SPEC Β§6.3 says so wherever scores are published.
See eval/rab/SPEC-v0.1.md,
docs/handovers/v0.4-r1-4-gap-table-2026-06-10.md,
and docs/research/r1-4-preregistration-2026-06-10.md.
The numbers below come from the current main branch β not aspirational, not from an older release. Every value is reproducible by cloning + running the listed command. (See the Evidence tiers legend for the β notation.)
| Surface | Verified | Where to check |
|---|---|---|
| Test suite | 4998 tests collected across tests/ (333 test files), all green on PR CI |
python -m pytest tests/ --collect-only -q |
| CASCADE / EVENT separation | Provable end-to-end via 5 release-gating invariants run against a real wiki fixture (not mocks) | tests/test_t7_release_gating_invariants.py |
| T6 causality cascade | 4 additional release-gating invariants pin foundational vs corroborative semantics | tests/test_t6_release_gating_invariants.py |
| Quality baseline (3-axis) | path_recall 1.00 / graded_answer 0.58 / abstention_f1 0.67 (median, post-calibration, n=3 paired reruns) | eval/qvt/baseline_2a31b20.json |
| STEP 7 regression | 20-query suite with gold_signals + abstention_truth + expected_path.nodes ground truth |
eval/regression/step7_queries.json (step7-v7) |
| Entity-anchor q15 fix | q15 ("David Soria Parraκ° λꡬμΌ?") path_recall 0.00 β 1.00 after JAMES_ENABLE_ENTITY_ANCHOR=1 + JAMES_EMBEDDING_MODEL=BAAI/bge-m3 + JAMES_ENABLE_QUERY_REWRITE=1 |
reports/research-runs/step7-bench-baseline-run*.json |
| Module size discipline | 20 KB cap enforced on every NEW core/ file (CLAUDE.md rule #5). Five legacy modules are grandfathered above the cap (largest: core/reasoning/reflect.py at 29.2 KB); split plans tracked in the v0.6 entry skeleton |
CLAUDE.md rule 5 + docs/handovers/v0.6-entry-skeleton-2026-06-13.md |
| Default-off invariant | Every routing layer added since v0.3 defaults OFF, plus the v0.5/v0.6 SaaS-readiness primitives (JAMES_TENANT_ID / JAMES_REQUIRE_TENANT_ID / JAMES_REQUIRE_APPROVAL_EVIDENCE / JAMES_OIDC_* / JAMES_CSP_USE_NONCE_*) β production fleets pulling current main see byte-identical retrieval to v0.4.4 unless they opt in |
JAMES_* env audit (CHANGELOG [0.5.0] section) |
| Deterministic contradiction arbitration | classify_contradiction is an LLM-free 4-rule decision tree (~10.2 KB pure function). Audit-replay-safe by construction. |
core/lifecycle/contradiction_arbiter.py |
| RAB v0.1.1 | JAMES AC/RF/PC = 1.000 / 1.000 / 1.000 vs Baseline-0 (vanilla quickstart + default logging) = 0.275 / 0.000 / 0.000 on scenario-S1. Deterministic scorer (no LLM judge); 3 metrics map to EU AI Act Art. 10/12/19 (applies from 2026-08-02). Tier ββ. | eval/rab/SPEC-v0.1.md + python scripts/research/rab_run.py --sut {reference,baseline0,james} |
| LRB v0.2.3 | R@1 Vanilla < Naive < JAMES preserved across a 4-point scale ladder (S2 N=80 β S3 publication N=1000, 12.5Γ scale) and across 4 model families (gemma4:e4b / gemma3:12b / mixtral / claude). S3 publication R@1: V/N/J = 0.502 / 0.721 / 0.845. JAMES β Naive gap > +0.10 at every scale point. Pattern + gap scale-robust βββ; absolute magnitude scenario-sensitive ββ. | papers/lrb-preprint/main.pdf + python scripts/research/lrb_run_s3.py --scale publication |
What is NOT yet headline-verified: a cross-model Graph-RAG ablation card. The Graph-RAG +0.41 path_coverage finding above is confirmed on the 4B local model (gemma4:e4b) only. The cross-model extension (a small + large model Γ 3 cells Γ n=3, ~14 h wall) is scaffolded as a one-command driver (scripts/research/graph_rag_synth_step2_cross_model.py) with pre-agreed interpretation rules locked in docs/evaluation/v0.5-graph-rag-contribution.md Β§3.2. Operator-launchable; awaits a free overnight GPU schedule.
Current state: v0.5 closed β productive "v0.5 closed, v0.6 not yet entered" interval.
Full per-release detail (and the post-close PR log) lives in
CHANGELOG.md; the canonical state pointer is the
v0.6 entry skeleton.
The v0.5 β v0.6 gate requires a β₯6-month external customer pilot, which
is not yet cleared, so the project continues in mother-platform hardening
mode: no domain-specific (vertical) features land until v1.0, per
CLAUDE.md rule #1. v0.5 and the post-close consolidation shipped
the Time-Travel Dashboard, multi-tenant / SaaS-readiness primitives, the
ontology Pack SDK, CSP security-header middleware, and the Graph-RAG
contribution measurement β all without touching
core/retrieval / core/graph traversal / core/reasoning.
Release history (newest first β see CHANGELOG.md for full notes):
-
v0.5 (2026-06-12) β Time-Travel Dashboard + SaaS-readiness primitives + ontology Pack SDK + CSP nonce.
-
v0.4.4 (2026-06-12) β LRB v0.2.3 S3 publication-scale (12.5Γ scale ladder) + 4-benchmark measurement-infrastructure closure. Papers ready for submission (RAB, LRB); arXiv endorsement pending.
-
v0.4.3 (2026-06-10) β RAB v0.1.1, the first replayable-audit benchmark mapping AC/RF/PC to EU AI Act Art. 10/12/19; multi-hop arc closed with honest nulls.
-
v0.4.2 (2026-06-06) β T5 Replayable Audit Graph (
reconstruct_graph_at(t)), the primitive RAB measures the quality of. -
v0.4.1 (2026-05-28) β T6 causality cascade: removing a base fact's sources auto-invalidates derived edges (
invalidate_derived_facts). -
v0.4.0 (2026-05-27) β Layer 4 first bundle: T1 Temporal Validity + T7 Supersede Chain + T2 Contradiction Arbitration. CASCADE vs EVENT separation provable end-to-end.
-
v0.3.0 (2026-05-17) β Foundation Hardening: all six readiness axes green; second-user validation passed 2026-05-13.
-
NOT production-ready β operational maturity (HTTPS / SSO / multi-tenancy / backup CLI) is a v1.0 deliverable; see SECURITY.md.
-
Designed with security-first principles end to end.
-
Open to collaboration β external contributors sign a one-click CLA on their first PR (see License).
SEKOS is not building one vertical. It is being hardened as a "mother platform" from which domain packs (legal, food, retail, travel, etc.) can branch off only at v1.0. Until then:
- No domain-specific features land in
core/. - Every change is graded against the same six-dimension readiness framework (architecture / extension API / eval contract / operational maturity / security boundary / production proof).
- The plugin contract that future packs will be built against is being designed and stress-tested.
See docs/PLATFORM_READINESS.md for
the 6 dimensions, 4 gates (v0.2 / v0.3 / v0.4 / v1.0), and 3
branching forms (Domain Pack / Distribution / Vertical Product).
SEKOS treats security as a design principle, not a feature:
- 3-stage access control: Vector β Graph β Output
- RBAC (4 roles) + ABAC (4 sensitivity levels)
- Instruction isolation: separates commands from data
- JWT auth + rate limiting + full audit log
- Sandboxed execution (for tool calls)
Realistic note: synthetic-data testing differs from adversarial production testing. See SECURITY.md.
| Feature | Status |
|---|---|
| Hybrid Search (Vector + BM25 + keyword + name) | Working |
| Cross-encoder reranker (MiniLM-L-6-v2) | Working β default ON (v0.3) |
| LLM query rewriter | Opt-in (v0.3) |
| Sources-aware Graph-RAG (Knowledge Cascade AβE) | Working (v0.3) |
| PolicyEngine (RBAC + ABAC + capability tokens) | Working (v0.2 Axis 4) |
| Reflection loop (draft β critique β revise) | Opt-in (v0.3) |
| Verification engine (security + fact check) | Opt-in (v0.3) |
| Tool router (read direct, write β Change Request) | Working (v0.3) |
| Change Request primitive (wiki + jobs + patches) | Working (v0.2.x + v0.3) |
| Self-evolution (human approval + auto-rollback) | Working (v0.2 Axis 5) |
Trace replay (one trace_id β full reasoning seq) |
Working (v0.3) |
| Multimodal (image/video/audio + OCR-poison quarantine) | Working (v0.2 Axis 4) |
| Web search (Tavily / DuckDuckGo fallback) | Working |
| Multi-LLM routing (Ollama + Claude CLI backends) | Working |
| STEP 7 regression baseline + RAGAS | Working (v0.2 Axis 2) |
| Real-data validation (second-user gate) | Passed 2026-05-13 |
- Backend: FastAPI + Uvicorn
- LLM: Ollama (Gemma, DeepSeek-Coder, LLaVA)
- Vector DB: ChromaDB
- Embedding: Sentence-Transformers (MiniLM)
- Search: BM25 + Vector hybrid
- Web search: Tavily (primary) + DuckDuckGo (fallback)
- Auth: JWT (python-jose)
- Storage: SQLite + markdown wiki
See ROADMAP.md and docs/PLATFORM_READINESS.md.
Summary:
- v0.1: Core engine + scaffolding (released)
- v0.2: Foundation Hardening β 6 axes (closed 2026-05-13)
- v0.3: Platform Skeleton β Cognitive Layer + Knowledge Cascade
- Change Request primitive (released 2026-05-17)
- v0.4 / v0.5: Benchmarks (RAB / LRB) + lifecycle + SaaS-readiness
- Pack SDK (v0.5 closed 2026-06-12)
- v1.0: Production-Grade Mother β HTTPS / SSO / multi-tenancy / SOC2 readiness; external developers can publish their own packs
Multi-agent specialists, optional Neo4j backend, OpenAI-compatible
API, streaming responses, and federation are speculative Beyond
v1.0 work β see ROADMAP.md Β§Beyond v1.0.
Welcome! See CONTRIBUTING.md.
Priority areas:
- Documentation, examples, translations
- Bug fixes, test coverage
- New tool integrations and LLM provider support
Licensed under the MIT License. Use freely. See LICENSE.
External contributors sign a one-click Contributor License Agreement on their first pull request (CLA Assistant). One signature covers all future contributions to the project. See CONTRIBUTING.md for the full Β§License & CLA section, and docs/legal/non-cla-contributions.md for contribution paths that don't require signing.
A full inventory of third-party dependency licenses is available in THIRD_PARTY_LICENSES.md.
Inspired by:
- Microsoft GraphRAG
- LightRAG
- Graphiti
- Palantir-style ontology approaches
- Architectural direction, Platform Readiness gates, and roadmap framing are discussed with LEO, a continuing collaborator on this work
Use at your own risk. This is research code. No guarantees regarding sensitive-data handling or production security without further hardening.
