Skip to content

Backtest slippage modeling + offline point-in-time LLM signal eval#2

Open
biswajeetdev wants to merge 1 commit into
mainfrom
feat/performance-report
Open

Backtest slippage modeling + offline point-in-time LLM signal eval#2
biswajeetdev wants to merge 1 commit into
mainfrom
feat/performance-report

Conversation

@biswajeetdev

Copy link
Copy Markdown
Owner

What

Two improvements surfaced from a review of the trading bot against current open-source LLM-trading practice:

1. Slippage modeling in backtest.py

Backtests were commission-only (0.05%/side), so net fills were optimistic. Adds a configurable SLIPPAGE (0.05%/side, override via params["slippage"]) applied adversely at every fill point — BUY entries fill higher; SELL/STOP/TARGET/EOD exits fill lower. Commission + slippage are now printed in the run header.

2. eval_llm.py — offline point-in-time eval of the LLM debate brain (new)

The rule layer is validated by backtest.py/god_mode.py, but the LLM debate layer (signals/debate_brain.py) — the bot's claimed source of alpha — was previously scored only by live online loops (rag/strategy_evolver, self_improver). This harness gives it an offline test:

  • Replays historical as-of dates, rebuilds point-in-time price/technical indicators (rolling windows are backward-only), and calls the real debate_decide brain.
  • Scores each call against realized forward returns: per-action avg forward return, directional hit-rate, and BUY edge vs baseline.
  • Look-ahead-bias guard: --anonymize masks the ticker so the model can't recall a known stock's trajectory (cf. arXiv 2510.07920 Profit Mirage, 2601.13770 Look-Ahead-Bench); warns about pre-cutoff memorization.
  • Honest scoping: only reconstructable price/technical context is fed; live-only alt-data (social, Congress, options flow, VIX, fundamentals) is passed empty.

Verification

  • Slippage: returns drop monotonically with trade count (NVDA −0.06% / 12 trades, MSFT −0.02% / 5 trades).
  • eval_llm.py: end-to-end smoke test (NVDA, 2 as-of dates) ran clean — point-in-time indicators → real brain → forward-return scoring → report.

Not included (follow-ups)

  • TradingAgents-style explicit risk-debate round.
  • Crisis-scenario stress sizing.

🤖 Generated with Claude Code

backtest.py: model adverse slippage (0.05%/side, override via params[slippage])
at all fill points — buys fill higher, sells/stops/targets/EOD lower. Surface
commission+slippage in the run header. Backtests were commission-only and thus
optimistic on net fills.

eval_llm.py (new): offline harness that replays historical as-of dates, rebuilds
point-in-time price/technical indicators, calls the real debate_decide brain, and
scores each call against realized forward returns (per-action avg return,
directional hit-rate, BUY edge vs baseline). The LLM debate layer was previously
validated only by live online loops. Includes a look-ahead-bias guard
(--anonymize) and feeds only reconstructable context (live-only alt-data passed
empty).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant