Skip to content

__test#30

Closed
LucasErcolano wants to merge 28 commits into
mainfrom
merge/main-feature-aggregation
Closed

__test#30
LucasErcolano wants to merge 28 commits into
mainfrom
merge/main-feature-aggregation

Conversation

@LucasErcolano

Copy link
Copy Markdown
Owner

test

Joacocade and others added 28 commits May 22, 2026 16:40
Resolves issue #20.

- Add memory_mode feature flag:
  baseline | experimental, env/YAML driven, rollback-safe.
- Add experiment runner:
  deterministic run_id, seed control, snapshot config,
  seed/prompt hashes, results.json export, runs/<case>/<variant>/<seed>/ layout.
- Add docs and configs:
  docs/memory_experimental.md, docs/experiment_harness.md,
  configs/memory_baseline.yaml, configs/memory_experimental.yaml,
  configs/experiments/example_case.yaml,
  configs/experiments/v1_smoke_*.yaml incl. no-report smoke variant.
- Add tests:
  backend/tests/test_memory_mode.py,
  backend/tests/test_experiment_runner.py,
  backend/tests/test_experiment_runner_memory.py.
- Update backend services/tests for experimental memory integration,
  spike baseline/rollback behavior, memory metrics logging, and
  safe backend logger handling.
- Update .gitignore for logs/runs/artifacts.
- Final pre-merge cleanup: move temporary smoke/log artifacts out of tree;
  preserve no-report smoke config for simulation path validation.

Issue: #20
Add optional wiki audit context layer to ReportAgent that compiles
simulation knowledge-base pages into structured context injected into
planning and section-generation prompts. Feature is fully opt-in via
build_wiki_context_for_report()/wiki_context=None — no change to
existing behavior when not activated.

Implementation:
- backend/app/services/wiki_memory/: new package (WikiStore,
  WikiCompiler, schemas, templates) for compiling wiki pages into
  context for report generation
- backend/app/services/report_agent.py: add wiki_context param,
  inject <wiki_audit_context> block into plan_outline and
  generate_section_react prompts with prior-knowledge labeling
- backend/app/api/report.py: integrate wiki context building with
  graceful degradation (non-fatal on error)
- backend/app/services/__init__.py: refactor to lazy-import heavy
  services, eager-export wiki_memory public API

Tests: 116/116 passing (compiler, store, integration, smoke).
Docs: docs/wiki_backed_report_memory.md with MVP activation details.
Smoke: scripts/real_lite_smoke.py for real-LLM verification.
Route OASIS simulation agents to different LLMs via a YAML model map and
record per-call telemetry (tokens, latency, estimated cost) so every agent
action is traceable to the model that produced it. Fully opt-in via
--model-map; single-model behavior is unchanged without it.

- model_router.py: load/validate model map, resolve ModelPolicy per agent
  (precedence by_agent_id > by_role > default), lazy CAMEL backend build.
  Secrets via env only (literal api_key rejected); fallback off by default.
- llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not
  LLMClient, which is not in the agent LLM path — writing one JSONL record
  per call with cost estimation and leak flags.
- run_reddit_simulation.py: --model-map flag, per-agent routed backends,
  redacted model_routing_audit.jsonl, round-stamped telemetry.
- scripts/export_telemetry.py: standalone CSV + summary export (stdlib only).
- configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/
  recipe, docs/multimodel_agents.md.
- tests/test_model_routing.py: 21 tests (validation, precedence, secrets,
  cost, telemetry wrapper).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21)

Supersedes the spike's inline agent_configs llm_* routing with the
configurable agent_model_map.yaml router + per-call telemetry, as the
spike itself called for. Spike evidence docs are preserved.

# Conflicts:
#	backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired):
  concurrent platforms make a shared sink.current_round racy; full wiring
  needs per-platform sinks/round contexts.
- SDK-internal retries are below the instrumented run()/arun(): one
  telemetry row per top-level call (final usage or final error).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#21)

Closes the issue's 'Smoke run con 2 modelos reales' checkbox:
- 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite
  (by_agent_id), default -> gemini-3.1-flash-lite
- every call traceable to (model, provider, tokens, cost, round) in
  llm_telemetry.jsonl; routing audit + CSV/JSONL export committed
- adds the no-GPU variant (any multi-model OpenAI-compatible endpoint)
  alongside the original local-vLLM recipe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
 field coverage (#21)

Addresses PR #14 review by @LucasErcolano:
- Canonical config file section: agent_model_map.yaml (runtime) vs
  configs/model_map_example.yaml (template) vs smoke evidence maps.
- Smoke run section now states the real 2-model run was executed
  (Gemini, no GPU) and is the final S2 evidence — fixes the stale
  "deferred" wording that contradicted README.md.
- Telemetry: explicit Issue #21 required-field coverage table,
  retries documented as stable (SDK-internal, not a separate field).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + cascading fallback) with Fusion patch

- Replace backend/app/utils/llm_client.py with pr-600 version (15bd114) on top of
  pr-318 refactor (52c177f): cleaner facade, _chat_raw internal helper, _clean_json_response
  for markdown fence stripping, repair_truncated_json module-level helper, cascading
  fallback to LLM_BOOST_* when primary LLM fails.
- Re-adapt Fusion patch: inject extra_body={'plugins':[{id:'fusion',...}]} when
  model ends in '/fusion', cap max_tokens to 4096 (Fusion router rejects > 4096).
  Panel via OPENROUTER_FUSION_PANEL (CSV) + OPENROUTER_FUSION_JUDGE, or preset
  via OPENROUTER_FUSION_PRESET (takes priority over panel).
- Add Config.LLM_BOOST_* (api_key/base_url/model_name, all None by default).
  No breaking change: when not set, _has_boost=False and chat_json raises
  ValueError if primary LLM fails (caller can wrap in try/except).
- Add 33 smoke tests under backend/tests/utils/test_llm_client.py covering
  fence stripping, think-tag stripping, truncation repair, boost fallback,
  Fusion routing. 28 pass, 5 xfail (document pr-600 gaps in repair_truncated_json
  phase 1/2 that need 'close final brace if depth_brace>0' upstream fix).

Validated end-to-end with current .env:
- Fusion ping (gemma+1.2b free + llama-3.1-8b judge): 3.4s, returns valid JSON
- Structured JSON (Alice/30/Beijing): 2.2s, correct schema
- Graphiti-compatible entity/relationship JSON: 3.0s, correct schema
- DeepInfra (SIMULATION_LLM) chat: 0.3s, 'PONG' response
- Fusion max_tokens cap: 16000 -> 4096 (verified via mock)
- Non-Fusion models: extra_body NOT injected, max_tokens passed through
- Strip think tags in _clean_json_response (Fusion models emit them)
- Strip markdown code fences (moved from _chat_raw for defense in depth)
- Extract first balanced JSON object/array from prose-prefixed responses
  (Fusion deliberation models prepend reasoning text before JSON payload)
- Prioritize whichever of { or [ appears first in the string
- Update tests: rename test_does_not_strip_think_tags -> test_strips_think_tags
  and add 3 new prose-extraction tests
- 31 passed, 5 xfailed
Cherry-pick of upstream PRs 666ghj#318 + 666ghj#600 (LLMClient structured output
+ cascading fallback) with Fusion plugin support. Includes:
- f5608b5 wip: Fusion plugin support
- 23f49fa feat(llm): port PR 666ghj#318+666ghj#600 with Fusion patch
- 20077d0 feat: _clean_json_response handles Fusion prose-prefixed JSON

Working config validated: Fusion 2+1 (gemini-2.5-flash-lite + llama-3.1-8b
panel, gemini-2.5-flash-lite judge). Ontology gen OK. ~$0.002/call.
…ti-core 0.28.2)

Upstream graphiti-core 0.28.2 bulk_utils.add_nodes_and_edges_bulk_tx does
`entity_data.update(node.attributes or {})` for the Neo4j branch and the
resulting Cypher does `SET n = $entity_data`. If any value inside
`node.attributes` is a dict (nested), list-of-dicts, datetime, set, UUID,
or any other non-primitive, Neo4j rejects it with:
  Property values can only be of primitive types or arrays thereof.
  Encountered: Map{}

This commit adds a non-invasive flatten pass that runs *before*
`self._graphiti._process_episode_data`:
  - _flatten_for_neo4j(value): coerce dict -> json, list/tuple -> recurse,
    set/frozenset -> sorted list, datetime -> isoformat, fallback str()
  - _flatten_attributes(attrs): apply per key, log coerced keys at INFO
  - _apply_flatten_pass(nodes, edges): in-place assign on the Pydantic
    models, with object.__setattr__ fallback for frozen configs

The original (nested) shape is preserved in memory; only the field
assignment at the last moment is flattened. Reads via
`EntityNode.attributes` after the call see the flattened primitives.

Tests:
- backend/tests/test_attribute_flatten.py: 21 unit tests (primitives,
  nested dicts, datetimes, sets, UUID, Decimal, fallback path)
- backend/tests/test_e2e_flatten_fix.py: e2e against real Neo4j,
  reproduces the bug without the fix and validates save+round-trip with.
…rvability

Closes #8, #21. Adds:
- backend/app/services/model_router.py: per-agent/role routing with
  precedence by_agent_id > by_role > default, secrets via env only
- backend/app/services/llm_telemetry.py: per-call tokens/latency/cost/
  hashes/JSON-validity/round, wrapping the CAMEL backend
- backend/scripts/run_reddit_simulation.py: --model-map flag + audit
  + telemetry JSONL output
- scripts/export_telemetry.py: stdlib CSV+JSONL export
- configs/model_map_example.yaml, configs/model_prices.yaml
- docs/multimodel_agents.md, runs/smoke_multi_model artefacts
- tests covering routing precedence, secret hygiene, telemetry.
Closes #20. Adds a persistent local Markdown Wiki as auxiliary
audit/evidence context for ReportAgent. Does not replace Zep, GraphRAG,
or the existing operational memory stack. Baseline behavior unchanged
unless wiki_context is explicitly available.

- backend/app/services/wiki_memory/ (WikiStore, WikiCompiler, schemas)
- build_wiki_context_for_report() with non-fatal fallback to None
- Per-run/case artifacts: agents.md, index.md, timeline.md, sources.md,
  contradictions.md, entities/*.md, claims/*.md, wiki_meta.json
- ReportAgent prompt integration via <wiki_audit_context> tag
- docs/wiki_backed_report_memory.md
- scripts/real_lite_smoke.py
- Unit + integration + smoke tests

# Conflicts:
#	.gitignore
Adds:
- backend/app/graph/graphiti_backend.py: semantic dedup (cosine >= 0.85)
  pre-insert + isolated-node pruning (lurker prevention)
- backend/app/services/deep_search.py: Tavily-backed autonomous search
  with max_date injection to prevent data leakage during backtesting
- backend/requirements.txt: tavily-python
- Evaluation artefacts for IPC Argentina 2025 backtesting case
- PR #27 starts from a pre-SIMULATION_LLM_ base; manually merged the
  S3 additions (PLANNING_CAPTURE_*, SIMILARITY_THRESHOLD, ENABLE_DEEP_SEARCH,
  DEEP_SEARCH_*, TAVILY_API_KEY, GEMINI_API_KEY) into the current
  backend/app/config.py, preserving the SIMULATION_LLM_* block from
  f20bfd9 and the FLASK_LLM_* / config additions from PRs #14 and #23.

# Conflicts:
#	.gitignore
#	backend/app/graph/graphiti_backend.py
@LucasErcolano LucasErcolano deleted the merge/main-feature-aggregation branch June 20, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants