Skip to content

Larens94/codedna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

193 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

🧬 CodeDNA: An In-Source Communication Protocol for AI Coding Agents

An in-source communication protocol where the writing agent encodes architectural context and the reading agent decodes it. The file is the channel. Every fragment carries the whole.

License: MIT Version DOI Website CI CodeQL Ko-fi

Discord Claude Code Plugin Languages Docs

Compatible with: Claude Code Cursor GitHub Copilot Windsurf OpenCode Gemini ChatGPT

CodeDNA is an inter-agent communication protocol implemented as in-source annotations. The writing agent embeds architectural context directly into source files; the reading agent decodes it at any point in the file. Like biological DNA β€” cut a hologram in half and you get two smaller complete images.

No RAG. No vector DB. No external rules files. Minimal drift (context co-located with code).

🎯 Less Prompt Engineering Needed: CodeDNA annotations help AI agents navigate the codebase with less manual guidance. Even less-technical users can get better multi-file fixes by describing the problem β€” the architectural context is already in the code.

CodeDNA Logo


The Problem

AI coding agents waste a significant fraction of their context window exploring irrelevant files, re-reading code, and missing cross-file constraints or reverse dependencies. The result: incomplete patches, higher token costs, and models that repeat the same mistakes across sessions.

The root cause is structural. Information like reverse dependencies and domain constraints cannot be inferred from a single file β€” they require reading the whole codebase. Without a way to persist that knowledge, every agent starts from scratch.

CodeDNA embeds this context directly in source files: used_by: maps reverse dependencies, rules: encodes domain constraints, and agent: / message: accumulate knowledge across sessions. It is not intended to replace retrieval systems, vector databases, or external memory β€” it provides a persistent architectural context layer inside the repository that any of those systems can build on.

This also enables agent-to-agent communication: a constraint discovered by Agent A is available to Agent B in a different session or a different model. Knowledge compounds in a versioned, inspectable form.

Preliminary results are encouraging. +13pp F1 on Gemini 2.5 Flash and +9pp on DeepSeek Chat on Django tasks β€” zero-shot, no fine-tuning, just annotations. Results are preliminary and require larger-scale validation.


Where CodeDNA sits in the AI memory stack

Every AI coding agent relies on multiple memory layers to navigate a codebase. Most of them are external to the code β€” chat history, vector databases, markdown rules files. CodeDNA is different: it is the only layer that lives inside the source files themselves.

CodeDNA Memory Layer Stack

Layer Examples Where it lives Shared across tools?
LLM / Agent Claude, GPT-4, Cursor, Copilot Cloud β€”
External memory Chat history, Projects, Memory API Cloud / external DB βœ— tool-specific
Native agent memory Claude auto-memory, Cursor memory, Windsurf memories, Devin session memory, … Local machine / tool cloud βœ— tool-specific
RAG / Vector DB Embeddings, Pinecone, pgvector External infrastructure depends
Markdown / Config README, CLAUDE.md, .cursorrules, AGENTS.md Repo (outside source files) partial (tool-specific files)
CodeDNA exports, rules, agent, message, .codedna Inside every source file + repo root βœ… always

Every other layer is either external to the code or tool-specific. CodeDNA is the only memory that:

  1. Travels with the source file β€” through clones, forks, and CI pipelines, with no infrastructure dependency
  2. Is readable by any agent on any tool β€” Claude, Cursor, Windsurf, Copilot, OpenCode, or a custom script all see the same annotations

CodeDNA does not replace native agent memories β€” it is additive. Every agentic tool (Claude Code, Cursor, Windsurf, Devin, and any future agent) has its own native memory for user preferences, feedback, and tool-specific context. That context belongs outside the repo. CodeDNA handles the architectural context that belongs inside it. Use both.

This is what makes CodeDNA composable. RAG systems, vector databases, native tool memories, and external memory layers can all be built on top of or alongside CodeDNA annotations. The in-source layer is the shared foundation any of those systems can read from β€” and the only one that survives a git clone.


Semantic vs structural reasoning

AI coding agents usually begin from a semantic prompt and must infer structure by exploring the repository.
Without persistent architectural context, each session starts from scratch.

CodeDNA turns semantic reasoning into structured reasoning.

Annotations allow the agent to follow explicit dependency and constraint signals instead of relying only on token similarity or retrieval.

This suggests that source code alone may not be the optimal reasoning layer for AI agents. While binary is the lowest layer for execution, structured source + annotations may be closer to the lowest layer for understanding.

How it works β€” live benchmark data

CodeDNA Navigation Demo

Three visual metaphors, same real data (django__django-11808 Β· DeepSeek-Chat Β· 5 runs). Without CodeDNA: agent opens 2 random files and stops β€” 8/10 critical files missed. With CodeDNA: follows the used_by: chain β€” finds 6/10 critical files. Retry risk βˆ’52%. β–Ά Interactive version β€” 3 metaphors

πŸ”„ The Network Effect: When an AI agent writes CodeDNA annotations, it leaves a navigable trail for every other agent that reads the code after it β€” regardless of vendor or model. The more agents that participate, the more useful the protocol becomes.


πŸ€” Who is CodeDNA for?

You are… Without CodeDNA With CodeDNA
Non-technical user Must learn prompt engineering to guide the AI agent through the codebase Just describe the problem β€” annotations give the agent structural context to follow
Junior developer AI finds the obvious file, misses the 5 related ones used_by: graph helps surface related files that may need changes
Senior developer Spends time writing detailed prompts every session Writes annotations once β€” that context persists across sessions
Team lead Each developer's AI makes different mistakes Annotations encode team knowledge β€” potentially more consistent results

The core idea: today, the quality of AI-assisted coding often depends on the user's ability to prompt. CodeDNA moves some of that knowledge from ephemeral prompts into persistent, version-controlled source code.


⚑ Quick Start

Setting up CodeDNA is two steps:

  1. Install the integration for your AI tool β€” tells the agent how to follow the protocol (Option 1 below)
  2. Annotate your existing codebase β€” adds CodeDNA headers to files already in your repo (Option 2 below)

For a new project, step 1 is enough β€” the agent annotates files as it creates and edits them. For an existing codebase, run both: step 1 first, then step 2 to bulk-annotate what's already there.

Want to try CodeDNA on a sample project or contribute to the codebase? See CONTRIBUTING.md for the dev setup.


Option 1 β€” AI Tool Integration (Claude Code, Cursor, Copilot, Windsurf, OpenCode)

Run one command for your tool:

bash <(curl -fsSL https://raw.githubusercontent.com/Larens94/codedna/main/integrations/install.sh) <tool>
Tool Option Enforcement
Claude Code claude-hooks βœ… Active β€” 4 hooks + .claude/settings.local.json
Cursor cursor-hooks βœ… Active β€” hook scripts in .cursor/hooks/ (v1.7+)
GitHub Copilot copilot-hooks βœ… Active β€” .github/hooks/hooks.json + scripts
Cline cline-hooks βœ… Active β€” hook scripts in .clinerules/hooks/ (v3.36+)
OpenCode opencode βœ… Active β€” JS plugin in .opencode/plugins/
Windsurf windsurf ⚠️ Instructions only
Antigravity / custom agents agents ⚠️ Instructions only
Aider claude ⚠️ Instructions only

Active enforcement = hooks validate annotations on every file write/edit automatically, regardless of session length or task complexity. Full reference: integrations/README.md.

all installs everything at once β€” only useful for teams where each developer uses a different tool.

Done. Your AI tool now follows the CodeDNA protocol. If you have existing files to annotate, continue with Option 2.


Option 2 β€” CLI: annotate an existing codebase

Annotate an entire project from the terminal. Supports local models via Ollama at zero cost:

pip install git+https://github.com/Larens94/codedna.git

# Free β€” structural only, no AI
codedna init /path/to/project --no-llm

# Free β€” local model via Ollama
codedna init /path/to/project --model ollama/llama3

# Paid β€” Anthropic Haiku (~$1-3 for a Django project)
ANTHROPIC_API_KEY=sk-... codedna init /path/to/project --model claude-haiku-4-5-20251001
Command What it does
codedna init PATH First-time annotation β€” L1 module headers + L2 function Rules:
codedna update PATH Incremental β€” only unannotated files (safe to re-run)
codedna check PATH Coverage report without modifying files
codedna init PATH --extensions ts go Annotate TypeScript + Go files too (L1 only)

Supported models via --model:

Provider Example Cost
Ollama (local) ollama/llama3, ollama/mistral Free
Anthropic claude-haiku-4-5-20251001 ~$1–3 / project
OpenAI openai/gpt-4o-mini Low
Google gemini/gemini-2.0-flash Low
None --no-llm Free

Option 3 β€” Claude Code Plugin (coming soon)

Status: accepted by Anthropic, currently under review β€” not yet available in the public directory. Use Option 1 or 2 in the meantime.

Once available:

claude plugin install codedna

No API key. No extra cost. Uses your existing Claude subscription. Adds /codedna:init, /codedna:check, /codedna:manifest, /codedna:impact commands + four enforcement hooks.


πŸ“Š Benchmark β€” SWE-bench Multi-Model Results

5 real Django issues from SWE-bench, tested across multiple LLMs. Same prompt, same tools, same tasks. Only difference: CodeDNA annotations.

Metric: File Localization F1 β€” harmonic mean of recall and precision on files read vs ground truth. Isolates the navigation bottleneck that precedes code generation.

Statistical test: Wilcoxon signed-rank test (one-tailed, H1: CodeDNA > Control) over F1 pairs across 5 tasks. N=5 with β‰₯5 runs per task at T=0.1.

Model Ctrl F1 DNA F1 Ξ” F1 p-value Tasks Won
Gemini 2.5 Flash 60% 72% +13% 0.040* 4/5
DeepSeek Chat 50% 60% +9% 0.11 4/5
Gemini 2.5 Pro 60% 69% +9% 0.11 3/5

3 of 3 models complete. Full data: benchmark_agent/runs/

Gemini 2.5 Flash: W+=14, N=5, p=0.040 βœ… significant. DeepSeek Chat: W+=12, N=5, p=0.11. Gemini 2.5 Pro: W+=12, N=5, p=0.11. All runs: 5 tasks Γ— 3–5 runs at T=0.1.

When CodeDNA Helps Most

Empirical analysis across 5 tasks (Gemini 2.5 Flash, β‰₯5 runs each) suggests a pattern:

Task type Example Ξ” F1
Clear dependency chain β€” A calls B which delegates to C dbshell β†’ client β†’ subprocess (12508) +9%
Delegation with backend fan-out β€” one interface, N backends Trunc β†’ ops.date_trunc_sql (13495) +21%
Feature addition with flag gating β€” new capability across feature/schema layers INCLUDE clause in Index (11991) +17%
XOR feature with multi-layer propagation Q() XOR support (14480) +18%
Cross-cutting fix β€” same pattern in N unrelated files, no shared ancestor __eq__ NotImplemented (11808) ~0%

Per-task breakdown

Task What it is Why hard without CodeDNA Ξ” F1 (Flash / DeepSeek)
12508 dbshell Add -c SQL flag to dbshell management command Entry point is obvious by name; 4 backend runshell_db() clients are hidden +9% / +1%
11991 INCLUDE Add INCLUDE clause support to Index schema.py is findable; 4 backend schema editors are not +17% / +6%
14480 Q() XOR Add XOR operator to Q() and QuerySet() ORM→SQL→backends cascade requires touching 7 files +18% / +14%
13495 Trunc tzinfo Fix timezone handling in TruncDay() for non-DateTimeField Per-backend date_trunc_sql() override not reachable by grep alone +22% / βˆ’8% ⚠
11808 __eq__ Fix __eq__ to return NotImplemented for unknown types Entry is models/base.py (847 lines, generic name); 5 subclasses are unconnected β‰ˆ0% / +34%

⚠ Task 13495 shows a model-dependent anomaly: Flash benefits strongly (+22pp) while DeepSeek and Pro regress (βˆ’8/βˆ’9pp). Under investigation.

Transparency note on 11808: the cross-cutting task was included deliberately to test the limits of the protocol. The benchmark annotations do not pre-populate a list of affected files β€” the agent must discover them independently. CodeDNA v0.7 shows Ξ” β‰ˆ 0% on this task type. This is reported as a known limitation, not hidden. See SPEC.md Β§2.4 for the proposed v0.8 extension (cross_cutting_patterns:) and why it would not constitute cheating.

CodeDNA is most effective when there is a navigable call chain. The used_by: graph guides the agent from entry point to all affected files. For cross-cutting concerns (same fix in many independent files with no shared ancestor), the benefit is smaller because there is no natural navigation path to follow.

Annotation Integrity

A full audit confirmed no task-specific hints are embedded in the codedna/ files. Where GT files appear in used_by: targets, it is because those files are genuine callers or subclasses β€” not cherry-picked. The cross-cutting task (11808, Ξ”β‰ˆ0%) confirms this: annotations described the architecture accurately but gave no navigation advantage because there is no call chain to follow.

One correction was made during the audit: base/schema.py in task 11991 initially listed only postgresql/schema.py in used_by: β€” updated to include all 4 backend schema editors that genuinely inherit from it.

Full audit: benchmark_agent/claude_code_challenge/django__django-13495/BENCHMARK_RESULTS.md

Pattern: cheaper models appear to benefit most. Flash (cheapest of the three) shows the strongest gain (p=0.040). This suggests annotating once may allow cheaper models to perform closer to more expensive ones β€” though the sample is small.

Full data: benchmark_agent/runs/ Β· Script: benchmark_agent/swebench/run_agent_multi.py


🀝 Multi-Agent Team Experiments

The SWE-bench benchmark above tests single-agent file navigation. Here we test a different question: can CodeDNA help teams of agents divide work without collisions and produce integrated software?

Two experiments, both using 5-agent teams orchestrated with Agno (TeamMode.coordinate). Same task, same model, same tools β€” only the instructions differ.

Metric Exp 1 β€” RPG (DeepSeek Chat) Exp 2 β€” SaaS (DeepSeek R1)
Duration (A / B) 1h 59m / 3h 11m (1.6x faster) 82.6m / 99m (17% faster)
Output quality Playable game / static scene Lower complexity (2.1 vs 3.1)
Annotation adoption 94% 98.2% (spontaneous, no reminders)
message: adoption 0 (not in prompt) 54 files (100%, organic)
Judge fixes needed 8 / 12 β€”

Full reports: Exp 1 report Β· Exp 2 data

Experiment 1 β€” 2D RPG Game (run_20260329_234232)

Setup: identical 5-agent team (GameDirector β†’ GameEngineer β†’ GraphicsSpecialist β†’ GameplayDesigner β†’ DataArchitect), same task, same model (DeepSeek deepseek-chat), same tool budget. Only the instructions differed.

Metric Condition A β€” CodeDNA Condition B β€” Standard
Total duration 1h 59m 3h 11m
Python files 50 45
Total LOC 10,194 14,096
Avg LOC/file 203 313
Annotation coverage 94% 0%
Judge fixes to boot 8 12
Player controllable after fixes Yes (WASD) No

CodeDNA was 1.60Γ— faster. More importantly: after judge intervention to fix both outputs, condition A produced a playable game (ECS running, 5 entities, WASD input). Condition B produced a visible but static scene β€” engine/ecs.py and gameplay/systems/player_system.py were both correct, but the integration layer connecting them was never written.

The director centralization cascade

Without used_by: contracts, the director spent 25 minutes occupying all four module namespaces before delegating (vs 12 minutes with CodeDNA). Every downstream specialist inherited structure they didn't design:

B Director builds full scaffold (25m β€” 2.0Γ— A)
  β†’ GameEngineer reverse-engineers structure (36m β€” 3.9Γ— A)
    β†’ GraphicsSpecialist works around pre-built renderer (41m β€” 1.4Γ— A)
      β†’ GameplayDesigner inherits 545-line monolith (35m β€” 2.6Γ— A)
        β†’ DataArchitect β€” independent domain, cleanest run (35m β€” 0.75Γ— A ← only exception)

The cascade peaks at the agent nearest to the director's territorial decisions and diminishes toward the most independent domain. used_by: forces ownership upfront β€” the director cannot occupy a module it declared as belonging to another agent.

Condition B's bugs were structurally different

All 8 fixes in condition A were corrections to existing code. Condition B had 12 fixes β€” 4 on existing code and 8 missing modules: entity_system.py, physics_engine.py, ai_system.py, player_controller.py, and the entire integration/ directory. These modules were declared by the director in game_state.py but never written by anyone. Writing them from scratch would be outside the scope of judge intervention.

More LOC does not mean more coverage. B produced 38% more lines (14,096 vs 10,194) but 10% fewer files. Average file size: 313 lines vs 203. More code, less functionality.

Full report: experiments/runs/run_20260329_234232/REPORT.md Β· Run data: experiments/runs/run_20260329_234232/

Experiment 2 β€” AgentHub SaaS webapp A/B test (run_20260331_002754)

Setup: same 5-agent team, same task (build AgentHub β€” a multi-tenant SaaS platform to rent, configure and deploy AI agents), upgraded model: DeepSeek R1 (deepseek-reasoner). Two conditions run sequentially on the same machine.

Metric Condition A β€” CodeDNA Condition B β€” Standard
Duration 82.6 min 99.0 min
Python files 55 50
Total LOC 14,156 11,872
Avg function length 14.3 lines 26.2 lines
Avg cyclomatic complexity 2.11 3.07
Max function complexity 10 16
Classes 90 50
Annotation coverage 98.2% 0%
Syntax errors 1 0
Validation score 0.73 0.87

The single syntax error in condition A was an em-dash character (β€” U+2014) introduced inside a rules: annotation field. Without it, validation scores would be near-equal. The gap does not reflect a systematic correctness difference.

98.2% adoption β€” spontaneous and sustained

DeepSeek R1 annotated 54 of 55 files with all 5 CodeDNA fields (exports, used_by, rules, agent, message) across a full 83-minute multi-agent session β€” without any prompting mid-run to "remember annotations." This is the highest adoption rate observed across all experiments.

Example β€” app/agents/agent_wrapper.py (written by the AgentIntegrator specialist):

"""app/agents/agent_wrapper.py β€” Wraps agno.Agent, counts tokens, enforces credit cap.

exports: AgentWrapper, CreditExhaustedError
used_by: app/agents/agent_runner.py β†’ run_agent_stream,
         app/services/agno_integration.py β†’ agent execution
rules:   Never call agno.Agent directly from API layer β€” always go through AgentWrapper
         Token count must be extracted from agno response metadata and stored in agent run tokens_used
         AgentWrapper must raise CreditExhaustedError (HTTP 402) before starting if balance < min_credits
         All agent instructions must be sanitised (strip HTML, limit to 10k chars)
agent:   AgentIntegrator | 2024-12-05 | implemented AgentWrapper with token counting and credit cap
         message: "implement tool usage tracking and cost estimation"
"""

The rules: field encodes four constraints (API layer isolation, token tracking, credit pre-check, input sanitization) that cannot be inferred by reading the file alone β€” they require knowing the full call chain. The message: field leaves a forward-planning note for the next agent in the session.

Level 2 annotations β€” function-level Rules

The same file shows L2 adoption inside the class body:

class AgentWrapper:
    """Wraps an agno.Agent instance with token counting and credit enforcement.

    Rules:
        1. Token counting is extracted from agno response metadata
        2. Credit cap is enforced before execution
        3. Instructions are sanitized (HTML stripped, length limited)
        4. All agent interactions go through this wrapper
    """

message: as inter-agent forward planning

The field was used consistently across all 54 annotated files to encode work that the agent knew was needed but was out of scope for its current task:

# app/agents/agent_runner.py
agent:   AgentIntegrator | 2024-12-05 | implemented agent runner with streaming and credit management
         message: "implement concurrent execution with asyncio semaphore"

# app/agents/memory_manager.py
agent:   AgentIntegrator | 2024-12-05 | implemented persistent memory with similarity search
         message: "implement memory summarization for long conversations"

# app/services/scheduler_service.py
agent:   Product Architect | 2024-03-30 | created scheduler service skeleton
         message: "implement job persistence for fault tolerance across restarts"

# app/services/agent_service.py
agent:   Product Architect | 2024-03-30 | created agent service skeleton
         message: "implement agent configuration validation against Agno framework schema"

These are not instructions the agent received β€” they are observations it left for itself (and for future agents), co-located with the code where the work would eventually happen. No agent was told to use message: this way.

What the unconstrained condition built

Condition B (no CodeDNA) produced working code but with a notable structural anomaly: the agent started Flask, then pivoted to FastAPI mid-session, leaving both stacks in the codebase simultaneously.

  • app/__init__.py imports Flask, SQLAlchemy, JWTManager, Bcrypt, Celery β€” initializes db = SQLAlchemy()
  • app/main.py creates a FastAPI application via create_fastapi_app()
  • run.py calls create_app() with a Flask-style app.run()
  • Jinja2 templates (base.html, home.html, marketplace.html) and static JS files are residue from the Flask phase

The pivot is not a bug in the usual sense β€” condition B's individual files are syntactically correct (0 errors). But the integration layer is inconsistent. CodeDNA's rules: and used_by: fields force the agent to declare architectural boundaries upfront, which appears to reduce mid-session pivots.

B went deeper on domain logic

Despite the architectural inconsistency, condition B fully implemented modules that A left as stubs:

  • app/billing/credit_engine.py (413 LOC) β€” complete CreditEngine with debit(), credit(), reserve(), release(), transaction logging, InsufficientCreditsError
  • app/memory/manager.py (638 LOC) β€” MemoryManager with vector similarity search, importance scoring, TTL expiry
  • demo_seed.py β€” realistic seed data (A had none)
  • test_app.py β€” basic test file (A had none)

A built stronger architecture (ServiceContainer DI, 9 exception types, async SQLAlchemy); B built more domain implementation. Neither was production-ready without further work.

Summary

Question Answer
Does a reasoning model adopt CodeDNA spontaneously? Yes β€” 98.2% across 54 files, sustained over 83 min
Does CodeDNA change code structure? Yes β€” lower complexity (2.11 vs 3.07), shorter functions (14 vs 26 lines), more classes (90 vs 50)
Does it prevent bugs? No β€” the one syntax error was inside an annotation field
Does message: get used as designed? Yes β€” 54 files, organically, without explicit instruction
Does it prevent mid-session architectural pivots? Likely yes β€” B changed stack mid-session; A did not

N=1 per condition. Results are directional, not statistically powered. The experiment is presented as a qualitative case study to complement the SWE-bench navigation benchmark.

Full run data: experiments/runs/run_20260331_002754/ Β· Script: experiments/run_experiment_webapp2.py

Limitations

Both multi-agent experiments are N=1 per condition β€” results are directional, not statistically powered. Experiment 2 used sequential runs on shared hardware (machine state may differ between conditions). Task 13495 shows an unexplained model-dependent anomaly (Flash +22pp, DeepSeek -8pp). Independent replication across different models, team sizes, and project types is needed.


Fix Quality β€” Claude Code Manual Session

The SWE-bench benchmark measures file navigation (did the agent open the right files?). This second benchmark measures fix completeness (did the agent produce the correct patch?).

Setup: two Claude Code sessions on django__django-13495, same model (claude-sonnet-4-6), same prompt, same bug. Ground truth: the official Django patch (7 files).

Bug: TruncDay('created_at', output_field=DateField(), tzinfo=tz_kyiv)
     generates SQL without AT TIME ZONE β€” timezone param silently ignored.

Results:

Metric Control CodeDNA
Session time ~10–11 min ~8 min
Total interactions (estimated) ~33 ~30
Failed edits 5 0
Files matching official patch 6 / 7 7 / 7
date_trunc_sql fixed (DateField) βœ… all backends βœ… all backends
time_trunc_sql fixed (TimeField) ❌ not touched βœ… all backends
sqlite3/base.py updated ❌ βœ…
SQLite approach matches official patch ❌ βœ…
Knowledge left for next agent ❌ βœ… rules: + agent: updated

What made the difference: a single rules: annotation on TimezoneMixin.get_tzname():

def get_tzname(self):
    """
    Rules: Timezone conversion must occur BEFORE applying datetime functions;
           database stores UTC but results must reflect input datetime's timezone.
    """

This described an architectural principle, not the bug. The control saw the same time_trunc_sql call on the line immediately below the reported bug β€” and didn't touch it. CodeDNA read the constraint and applied the fix to the full pattern.

Validity note: this is a single run, not a statistically powered study. The result is presented as an illustrative case, not a population estimate. The causal mechanism is traceable: one annotation changed the frame from "fix DateField" to "fix the timezone pattern across all output fields."

Full report: benchmark_agent/claude_code_challenge/django__django-13495/BENCHMARK_RESULTS.md Session logs: control Β· codedna Reproduce: HOW_TO_RERUN.md

Run it yourself:

  1. Clone the control repository:
    git clone https://github.com/Larens94/codedna-challenge-control
  2. Clone the CodeDNA-annotated version:
    git clone https://github.com/Larens94/codedna-challenge-codedna
  3. Open either repository in your AI coding agent (Claude Code, Cursor, etc.)
  4. Paste the same prompt into your agent and score how many of the 7 patch files it touches.

Quick test with the CLI:

# Check annotation coverage
codedna check ./codedna-challenge-codedna

# Run a dry-run annotation (no LLM)
codedna init ./codedna-challenge-codedna --no-llm --dry-run

πŸ—ΊοΈ Roadmap

CodeDNA v0.8 is the current release. The planned development path:

Milestone Goal Status
M1 β€” Protocol & CLI v0.8 spec Β· codedna init/update/check Β· AST-based auto-extraction Β· message: agent chat layer βœ… Done
M2 β€” Benchmark Expansion 20+ SWE-bench tasks Β· 5+ LLMs Β· Zenodo dataset Β· public dashboard πŸ”œ
M3 β€” Multi-Tool Hooks Active enforcement hooks for Claude Code Β· Cursor Β· Copilot Β· Cline Β· OpenCode β€” validates on every write βœ… Done
M4 β€” Language Extension 11 languages: Python Β· TS/JS Β· Go Β· PHP (Laravel) Β· Rust Β· Java Β· Kotlin Β· Ruby Β· C# Β· Swift Β· Blade/Jinja2/Vue βœ… Done
M5 β€” Editor & Workflow VS Code extension (used_by graph Β· agent timeline Β· model heatmap) Β· GitHub Action CI πŸ”œ
M6 β€” Research & Dissemination arXiv preprint Β· ICSE NIER/workshop submission Β· annotate Flask, FastAPI πŸ”œ

This roadmap is part of a funding application to NLnet NGI0 Commons Fund (deadline April 1st 2026). If you find CodeDNA useful and want to support its development, ⭐ the repo and share it.


πŸ”¬ v0.8 Features

message: β€” Persistent Agent Chat in Code

The agent: field records what an agent did. The message: sub-field (new in v0.8) adds a conversational layer β€” soft observations, open questions, and forward-looking notes left directly for the next agent.

"""analytics/revenue.py β€” Monthly/annual revenue aggregation.

...
agent:   claude-sonnet-4-6 | anthropic | 2026-03-10 | Implemented monthly_revenue.
         message: "rounding edge case in multi-currency β€” investigate before next release"
agent:   gemini-2.5-pro    | google    | 2026-03-18 | Added annual_summary.
         message: "@prev: confirmed, promoted to rules:. New: timezone rollover in January"
"""

message: works at both levels:

  • Level 1 (module docstring) β€” for agents that read the full file
  • Level 2 (function docstring) β€” for agents using a sliding window that never sees the top of the file

The lifecycle: an observation left in message: either gets promoted to rules: (architectural truth confirmed) or dismissed with a reply. Append-only, never deleted.

Agent Telemetry via Git Trailers

Git is already immutable, append-only, and diff-complete. v0.8 uses git trailers β€” the same standard as Co-Authored-By:, natively recognised by GitHub β€” to embed AI session metadata directly in commit messages:

implement monthly revenue aggregation

AI-Agent:    claude-sonnet-4-6
AI-Provider: anthropic
AI-Session:  s_a1b2c3
AI-Visited:  analytics/revenue.py, payments/models.py, api/reports.py
AI-Message:  found rounding edge case in multi-currency β€” investigate before next release

Git already records the diff, date, and changed files. AI-Visited: is the only addition β€” files read during the session, which git does not track natively.

This gives you audit queries immediately:

git log --grep="AI-Agent:"                          # all AI commits
git log --grep="AI-Agent: claude" -p -- revenue.py  # claude's changes to a file
git log --format="%b" | grep "AI-Agent:" | sort | uniq -c  # model distribution

Three-tier architecture: git (authoritative audit, full diff) ↔ .codedna (lean session summary for agent navigation) ↔ file agent: field (one-liner, sliding-window safe). A session_id links all three.

VSCode Extension (planned, M3)

Built on top of git log with AI trailers:

  • CodeLens β€” last AI agent + commit count inline on every file and function
  • File heatmap β€” how many AI sessions touched each file, by provider
  • Agent Timeline β€” chronological session log with git diff per session
  • Stats panel β€” model distribution chart, navigation efficiency per model

Full spec: SPEC.md Β§4.7–4.8 Β· VSCode extension is planned for M3.


🧬 The Four Levels

Level 0 β€” Project Manifest .codedna (The view from far away)

A single YAML file at the repo root. The agent reads this first β€” before opening any source file β€” to understand packages, their purposes, and inter-package dependencies.

# .codedna β€” auto-generated by codedna init
project: myapp
packages:
  payments/:
    purpose: "Invoice generation, payment processing"
  analytics/:
    purpose: "Revenue reports, KPI dashboards"
    depends_on: [payments/, tenants/]
  tenants/:
    purpose: "Multi-tenant management, suspension"

Level 1 β€” Module Header (The view from close up: ~50 tokens)

A docstring at the top of every file. Only includes information that cannot be inferred from the code: the public API (exports:), who depends on this file (used_by:), and domain constraints (rules:). Import statements already declare dependencies β€” no need to duplicate them.

"""orders/orders.py β€” Order lifecycle management.

exports: get_active_orders() -> list[dict] | create_order(user_id, items) -> None
used_by: analytics/revenue.py β†’ get_revenue_rows
rules:   User system uses soft delete β€” NEVER return orders for users
         where users.deleted_at IS NOT NULL. Always JOIN on users.
"""

Level 2 β€” Function-Level Rules (The view from very close)

Rules: docstrings on critical functions, written organically by agents as they discover constraints. Each agent that fixes a bug or learns something important leaves a Rules: for the next agent β€” knowledge accumulates over time.

def get_active_orders() -> list[dict]:
    """Return all non-cancelled orders for active (non-deleted) users.

    Rules:   MUST JOIN users and filter deleted_at before returning results.
             Failure to filter inflates revenue reports with deleted-user orders.
    """

Level 3 β€” Semantic Naming (Cognitive compression)

Variable names encode type, shape, domain, and origin. Any 10-line extract is self-documenting.

# ❌ Standard β€” agent must trace the entire call chain
data  = get_users()
price = request.json["price"]

# βœ… CodeDNA β€” readable in any context window
list_dict_users_from_db  = get_users()
int_cents_price_from_req = request.json["price"]

Planner Read Protocol

To plan edits across 10+ files: read .codedna first, then read only the module docstring of each file (first 8–12 lines), build an exports: β†’ used_by: graph, then open only the relevant files in full.


🎯 Annotation Design Principle β€” Architecture, Not Answers

The key rule for rules: annotations: describe the mechanism, not the solution.

# ❌ Wrong β€” gives away the answer
rules:   Fix mysql/operations.py, oracle/operations.py, postgresql/operations.py

# βœ… Correct β€” describes the delegation chain
rules:   Trunc.as_sql() delegates to connection.ops.date_trunc_sql() and
         time_trunc_sql(). Each backend implements these independently.

used_by: is a navigation map, not a to-do list. The agent reasons about which targets are relevant to the current task and opens only those. In the benchmark, CodeDNA runs showed P=100% (zero wasted reads) on the tasks measured, while control runs scattered across irrelevant files.


πŸ”„ Inter-Agent Knowledge Accumulation

CodeDNA is designed for multi-agent environments β€” different models, different tools, different sessions. Each agent leaves knowledge for the next:

Agent A fixes a bug β†’ adds Rules: "MUST filter soft-deleted users"
Agent B reads Rules: β†’ avoids the same bug without re-discovering it
Agent C discovers an edge case β†’ extends the Rules:

Unlike docs (which go stale), Rules: annotations are co-located with the code β€” read every time the function is edited.

Current benchmark results are zero-shot β€” no fine-tuning on the protocol. Models follow used_by: and rules: by general language understanding alone. A fine-tuned model could potentially treat these as native structured signals, which might reduce variance further β€” this remains to be tested.

See SPEC.md for the full inter-agent model, verification protocol, fine-tuning potential, and training corpus design.


🌐 Language Support

CodeDNA v0.8 supports 11 languages. Python is the reference implementation with full AST-based extraction (L1 module headers + L2 function Rules:). All other languages get L1-only annotation via regex adapters β€” no external toolchain required.

Language Extensions L1 L2 Framework awareness
Python .py βœ… AST βœ… AST β€”
TypeScript / JavaScript .ts .tsx .js .jsx .mjs βœ… β€” β€”
Go .go βœ… β€” β€”
PHP .php βœ… β€” Laravel (Route facades, Eloquent) Β· Phalcon (Controller/Model, DI, Router)
Rust .rs βœ… β€” β€”
Java .java βœ… β€” β€”
Kotlin .kt .kts βœ… β€” β€”
C# .cs βœ… β€” β€”
Swift .swift βœ… β€” β€”
Ruby .rb βœ… β€” β€”

Template engines (L1 via block-comment extraction):

Template Extensions Comment syntax
Blade (Laravel) .blade.php {{-- --}}
Jinja2 / Twig .j2 .jinja2 .twig {# #}
Volt (Phalcon) .volt {# #}
ERB / EJS .erb .ejs <%# %>
Handlebars / Mustache .hbs .mustache {{!-- --}}
Razor / Cshtml .cshtml .razor @* *@
Vue SFC / Svelte .vue .svelte <!-- -->

Pass --extensions to annotate non-Python files:

codedna init ./src --extensions ts go              # TypeScript + Go
codedna init ./app --extensions php                # PHP/Laravel or PHP/Phalcon
codedna init ./templates --extensions volt blade   # Phalcon Volt + Laravel Blade
codedna init . --extensions ts go php rs java      # mixed project
codedna check . --extensions ts go -v              # coverage report

PHP + Laravel example

<?php
// app/Http/Controllers/UserController.php β€” Handles user CRUD endpoints.
//
// exports: UserController::index() -> Response
//          UserController::store(Request) -> JsonResponse
// used_by: routes/web.php -> Route::resource('users', UserController::class)
// rules:   must extend App\Http\Controllers\Controller.
//          all public methods are auto-detected as exports.
// agent:   claude-sonnet-4-6 | anthropic | 2026-04-02 | s_20260402_001 | initial controller scaffold

PHP + Phalcon example

<?php
// app/controllers/UserController.php β€” Handles user CRUD in Phalcon MVC.
//
// exports: UserController::indexAction() -> Response
//          UserController::createAction() -> Response
//          route:/users
//          service:userService
// used_by: app/config/router.php -> $router->addGet('/users', ...)
// rules:   extends Phalcon\Mvc\Controller β€” do not add constructor, use DI.
//          $di->set('userService', ...) registers this service globally.
// agent:   claude-sonnet-4-6 | anthropic | 2026-04-02 | s_20260402_001 | initial Phalcon controller

namespace App\Controllers;

use Phalcon\Mvc\Controller;

class UserController extends Controller
{
    public function indexAction() { ... }
    public function createAction() { ... }
}

The PHP adapter auto-detects:

  • extends Controller / extends Model / extends Phalcon\Mvc\Controller β†’ marks as Phalcon component
  • $router->addGet('/uri', ...) β†’ exports as route:/uri
  • $di->set('serviceName', ...) / $di->setShared(...) β†’ exports as service:serviceName
  • Public methods β†’ annotated as ClassName::method

πŸ“ Repository Structure

codedna/
β”œβ”€β”€ README.md               ← you are here
β”œβ”€β”€ QUICKSTART.md           ← 2-minute setup for every AI tool
β”œβ”€β”€ SPEC.md                 ← full technical specification v0.8
β”œβ”€β”€ integrations/
β”‚   β”œβ”€β”€ CLAUDE.md               ← Claude Code system prompt
β”‚   β”œβ”€β”€ .cursorrules             ← Cursor / Windsurf rules file
β”‚   β”œβ”€β”€ .windsurfrules           ← Windsurf rules file
β”‚   β”œβ”€β”€ .clinerules              ← Cline rules file
β”‚   β”œβ”€β”€ copilot-instructions.md ← GitHub Copilot instructions
β”‚   └── install.sh              ← one-line installer for all tools
β”œβ”€β”€ codedna_tool/           ← installable CLI package (codedna init/update/check)
β”‚   β”œβ”€β”€ cli.py
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── languages/          ← per-language annotation adapters
β”œβ”€β”€ codedna-plugin/         ← Claude Code plugin (pending review)
β”œβ”€β”€ benchmark_agent/
β”‚   β”œβ”€β”€ swebench/
β”‚   β”‚   β”œβ”€β”€ run_agent_multi.py          ← multi-model benchmark (5 providers)
β”‚   β”‚   └── analyze_multi.py            ← multi-model comparison
β”‚   β”œβ”€β”€ claude_code_challenge/          ← fix-quality benchmark (control vs CodeDNA)
β”‚   β”‚   └── django__django-13495/
β”‚   └── runs/                           ← results by model
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ python/             ← annotated Python example
β”‚   β”œβ”€β”€ python-api/         ← annotated Flask/FastAPI example
β”‚   β”œβ”€β”€ typescript-api/     ← annotated TypeScript example
β”‚   β”œβ”€β”€ go-api/             ← annotated Go example
β”‚   β”œβ”€β”€ java-service/       ← annotated Java example
β”‚   β”œβ”€β”€ rust-cli/           ← annotated Rust example
β”‚   β”œβ”€β”€ php-laravel/        ← annotated Laravel example
β”‚   └── ruby-sinatra/       ← annotated Ruby/Sinatra example
β”œβ”€β”€ paper/                  ← scientific paper (arXiv preprint)
β”‚   β”œβ”€β”€ codedna_paper.pdf
β”‚   β”œβ”€β”€ codedna_paper.html
β”‚   β”œβ”€β”€ codedna_whitepaper_EN.html
β”‚   └── codedna_paper_IT.html
└── tools/
    β”œβ”€β”€ pre-commit              ← CodeDNA v0.8 pre-commit hook (validates staged files)
    β”œβ”€β”€ install-hooks.sh        ← installer: copies pre-commit into .git/hooks/
    β”œβ”€β”€ validate_manifests.py   ← deep annotation validator (format, agent dates, purpose length)
    β”œβ”€β”€ agent_history.py        ← session history viewer (reads AI git trailers)
    β”œβ”€β”€ traces_to_training.py   ← SFT/DPO/PRM dataset converter from benchmark runs
    └── extract_city_data.py    ← extract annotations to JSON for city visualization

πŸ’¬ A note from the author

This is my first paper. I'm not a researcher β€” I'm a developer who is genuinely passionate about AI and how it interacts with code.

I built CodeDNA because I kept running into the same problem: AI agents making mistakes not because they were wrong, but because they had no context. I wondered: what if the context was already in the file? What if every snippet the agent read was self-sufficient?

I'm sharing this with complete humility. The benchmark is real, the data is reproducible, and the spec is open. Maybe it's useful to you. Maybe it sparks a better idea. Either way, I hope it contributes something.

If you find it helpful, try it, break it, improve it β€” or just tell me what you think. Feedback from people who actually use it is the only way this gets better.

If CodeDNA saved you some context tokens, a coffee is always welcome: ko-fi.com/codedna

β€” Fabrizio


Contributing

See CONTRIBUTING.md. Examples in any language are welcome.

License

MIT

About

A lightweight annotation standard that helps AI agents navigate codebases faster, with fewer file reads and tool calls

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors