__test#30
Closed
LucasErcolano wants to merge 28 commits into
Closed
Conversation
Resolves issue #20. - Add memory_mode feature flag: baseline | experimental, env/YAML driven, rollback-safe. - Add experiment runner: deterministic run_id, seed control, snapshot config, seed/prompt hashes, results.json export, runs/<case>/<variant>/<seed>/ layout. - Add docs and configs: docs/memory_experimental.md, docs/experiment_harness.md, configs/memory_baseline.yaml, configs/memory_experimental.yaml, configs/experiments/example_case.yaml, configs/experiments/v1_smoke_*.yaml incl. no-report smoke variant. - Add tests: backend/tests/test_memory_mode.py, backend/tests/test_experiment_runner.py, backend/tests/test_experiment_runner_memory.py. - Update backend services/tests for experimental memory integration, spike baseline/rollback behavior, memory metrics logging, and safe backend logger handling. - Update .gitignore for logs/runs/artifacts. - Final pre-merge cleanup: move temporary smoke/log artifacts out of tree; preserve no-report smoke config for simulation path validation. Issue: #20
Add optional wiki audit context layer to ReportAgent that compiles simulation knowledge-base pages into structured context injected into planning and section-generation prompts. Feature is fully opt-in via build_wiki_context_for_report()/wiki_context=None — no change to existing behavior when not activated. Implementation: - backend/app/services/wiki_memory/: new package (WikiStore, WikiCompiler, schemas, templates) for compiling wiki pages into context for report generation - backend/app/services/report_agent.py: add wiki_context param, inject <wiki_audit_context> block into plan_outline and generate_section_react prompts with prior-knowledge labeling - backend/app/api/report.py: integrate wiki context building with graceful degradation (non-fatal on error) - backend/app/services/__init__.py: refactor to lazy-import heavy services, eager-export wiki_memory public API Tests: 116/116 passing (compiler, store, integration, smoke). Docs: docs/wiki_backed_report_memory.md with MVP activation details. Smoke: scripts/real_lite_smoke.py for real-LLM verification.
Route OASIS simulation agents to different LLMs via a YAML model map and record per-call telemetry (tokens, latency, estimated cost) so every agent action is traceable to the model that produced it. Fully opt-in via --model-map; single-model behavior is unchanged without it. - model_router.py: load/validate model map, resolve ModelPolicy per agent (precedence by_agent_id > by_role > default), lazy CAMEL backend build. Secrets via env only (literal api_key rejected); fallback off by default. - llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not LLMClient, which is not in the agent LLM path — writing one JSONL record per call with cost estimation and leak flags. - run_reddit_simulation.py: --model-map flag, per-agent routed backends, redacted model_routing_audit.jsonl, round-stamped telemetry. - scripts/export_telemetry.py: standalone CSV + summary export (stdlib only). - configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/ recipe, docs/multimodel_agents.md. - tests/test_model_routing.py: 21 tests (validation, precedence, secrets, cost, telemetry wrapper). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21) Supersedes the spike's inline agent_configs llm_* routing with the configurable agent_model_map.yaml router + per-call telemetry, as the spike itself called for. Spike evidence docs are preserved. # Conflicts: # backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired): concurrent platforms make a shared sink.current_round racy; full wiring needs per-platform sinks/round contexts. - SDK-internal retries are below the instrumented run()/arun(): one telemetry row per top-level call (final usage or final error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#21) Closes the issue's 'Smoke run con 2 modelos reales' checkbox: - 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite (by_agent_id), default -> gemini-3.1-flash-lite - every call traceable to (model, provider, tokens, cost, round) in llm_telemetry.jsonl; routing audit + CSV/JSONL export committed - adds the no-GPU variant (any multi-model OpenAI-compatible endpoint) alongside the original local-vLLM recipe Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
field coverage (#21) Addresses PR #14 review by @LucasErcolano: - Canonical config file section: agent_model_map.yaml (runtime) vs configs/model_map_example.yaml (template) vs smoke evidence maps. - Smoke run section now states the real 2-model run was executed (Gemini, no GPU) and is the final S2 evidence — fixes the stale "deferred" wording that contradicted README.md. - Telemetry: explicit Issue #21 required-field coverage table, retries documented as stable (SDK-internal, not a separate field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o y preparar rama
…e Deep Search autónomo
… + cascading fallback) with Fusion patch - Replace backend/app/utils/llm_client.py with pr-600 version (15bd114) on top of pr-318 refactor (52c177f): cleaner facade, _chat_raw internal helper, _clean_json_response for markdown fence stripping, repair_truncated_json module-level helper, cascading fallback to LLM_BOOST_* when primary LLM fails. - Re-adapt Fusion patch: inject extra_body={'plugins':[{id:'fusion',...}]} when model ends in '/fusion', cap max_tokens to 4096 (Fusion router rejects > 4096). Panel via OPENROUTER_FUSION_PANEL (CSV) + OPENROUTER_FUSION_JUDGE, or preset via OPENROUTER_FUSION_PRESET (takes priority over panel). - Add Config.LLM_BOOST_* (api_key/base_url/model_name, all None by default). No breaking change: when not set, _has_boost=False and chat_json raises ValueError if primary LLM fails (caller can wrap in try/except). - Add 33 smoke tests under backend/tests/utils/test_llm_client.py covering fence stripping, think-tag stripping, truncation repair, boost fallback, Fusion routing. 28 pass, 5 xfail (document pr-600 gaps in repair_truncated_json phase 1/2 that need 'close final brace if depth_brace>0' upstream fix). Validated end-to-end with current .env: - Fusion ping (gemma+1.2b free + llama-3.1-8b judge): 3.4s, returns valid JSON - Structured JSON (Alice/30/Beijing): 2.2s, correct schema - Graphiti-compatible entity/relationship JSON: 3.0s, correct schema - DeepInfra (SIMULATION_LLM) chat: 0.3s, 'PONG' response - Fusion max_tokens cap: 16000 -> 4096 (verified via mock) - Non-Fusion models: extra_body NOT injected, max_tokens passed through
- Strip think tags in _clean_json_response (Fusion models emit them)
- Strip markdown code fences (moved from _chat_raw for defense in depth)
- Extract first balanced JSON object/array from prose-prefixed responses
(Fusion deliberation models prepend reasoning text before JSON payload)
- Prioritize whichever of { or [ appears first in the string
- Update tests: rename test_does_not_strip_think_tags -> test_strips_think_tags
and add 3 new prose-extraction tests
- 31 passed, 5 xfailed
…cktesting evaluations
Cherry-pick of upstream PRs 666ghj#318 + 666ghj#600 (LLMClient structured output + cascading fallback) with Fusion plugin support. Includes: - f5608b5 wip: Fusion plugin support - 23f49fa feat(llm): port PR 666ghj#318+666ghj#600 with Fusion patch - 20077d0 feat: _clean_json_response handles Fusion prose-prefixed JSON Working config validated: Fusion 2+1 (gemini-2.5-flash-lite + llama-3.1-8b panel, gemini-2.5-flash-lite judge). Ontology gen OK. ~$0.002/call.
…ti-core 0.28.2)
Upstream graphiti-core 0.28.2 bulk_utils.add_nodes_and_edges_bulk_tx does
`entity_data.update(node.attributes or {})` for the Neo4j branch and the
resulting Cypher does `SET n = $entity_data`. If any value inside
`node.attributes` is a dict (nested), list-of-dicts, datetime, set, UUID,
or any other non-primitive, Neo4j rejects it with:
Property values can only be of primitive types or arrays thereof.
Encountered: Map{}
This commit adds a non-invasive flatten pass that runs *before*
`self._graphiti._process_episode_data`:
- _flatten_for_neo4j(value): coerce dict -> json, list/tuple -> recurse,
set/frozenset -> sorted list, datetime -> isoformat, fallback str()
- _flatten_attributes(attrs): apply per key, log coerced keys at INFO
- _apply_flatten_pass(nodes, edges): in-place assign on the Pydantic
models, with object.__setattr__ fallback for frozen configs
The original (nested) shape is preserved in memory; only the field
assignment at the last moment is flattened. Reads via
`EntityNode.attributes` after the call see the flattened primitives.
Tests:
- backend/tests/test_attribute_flatten.py: 21 unit tests (primitives,
nested dicts, datetimes, sets, UUID, Decimal, fallback path)
- backend/tests/test_e2e_flatten_fix.py: e2e against real Neo4j,
reproduces the bug without the fix and validates save+round-trip with.
…rvability Closes #8, #21. Adds: - backend/app/services/model_router.py: per-agent/role routing with precedence by_agent_id > by_role > default, secrets via env only - backend/app/services/llm_telemetry.py: per-call tokens/latency/cost/ hashes/JSON-validity/round, wrapping the CAMEL backend - backend/scripts/run_reddit_simulation.py: --model-map flag + audit + telemetry JSONL output - scripts/export_telemetry.py: stdlib CSV+JSONL export - configs/model_map_example.yaml, configs/model_prices.yaml - docs/multimodel_agents.md, runs/smoke_multi_model artefacts - tests covering routing precedence, secret hygiene, telemetry.
Closes #20. Adds a persistent local Markdown Wiki as auxiliary audit/evidence context for ReportAgent. Does not replace Zep, GraphRAG, or the existing operational memory stack. Baseline behavior unchanged unless wiki_context is explicitly available. - backend/app/services/wiki_memory/ (WikiStore, WikiCompiler, schemas) - build_wiki_context_for_report() with non-fatal fallback to None - Per-run/case artifacts: agents.md, index.md, timeline.md, sources.md, contradictions.md, entities/*.md, claims/*.md, wiki_meta.json - ReportAgent prompt integration via <wiki_audit_context> tag - docs/wiki_backed_report_memory.md - scripts/real_lite_smoke.py - Unit + integration + smoke tests # Conflicts: # .gitignore
Adds: - backend/app/graph/graphiti_backend.py: semantic dedup (cosine >= 0.85) pre-insert + isolated-node pruning (lurker prevention) - backend/app/services/deep_search.py: Tavily-backed autonomous search with max_date injection to prevent data leakage during backtesting - backend/requirements.txt: tavily-python - Evaluation artefacts for IPC Argentina 2025 backtesting case - PR #27 starts from a pre-SIMULATION_LLM_ base; manually merged the S3 additions (PLANNING_CAPTURE_*, SIMILARITY_THRESHOLD, ENABLE_DEEP_SEARCH, DEEP_SEARCH_*, TAVILY_API_KEY, GEMINI_API_KEY) into the current backend/app/config.py, preserving the SIMULATION_LLM_* block from f20bfd9 and the FLASK_LLM_* / config additions from PRs #14 and #23. # Conflicts: # .gitignore # backend/app/graph/graphiti_backend.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
test