Backtest slippage modeling + offline point-in-time LLM signal eval by biswajeetdev · Pull Request #2 · biswajeetdev/ai-trader

biswajeetdev · 2026-06-16T14:50:49Z

What

Two improvements surfaced from a review of the trading bot against current open-source LLM-trading practice:

1. Slippage modeling in `backtest.py`

Backtests were commission-only (0.05%/side), so net fills were optimistic. Adds a configurable SLIPPAGE (0.05%/side, override via params["slippage"]) applied adversely at every fill point — BUY entries fill higher; SELL/STOP/TARGET/EOD exits fill lower. Commission + slippage are now printed in the run header.

2. `eval_llm.py` — offline point-in-time eval of the LLM debate brain (new)

The rule layer is validated by backtest.py/god_mode.py, but the LLM debate layer (signals/debate_brain.py) — the bot's claimed source of alpha — was previously scored only by live online loops (rag/strategy_evolver, self_improver). This harness gives it an offline test:

Replays historical as-of dates, rebuilds point-in-time price/technical indicators (rolling windows are backward-only), and calls the real debate_decide brain.
Scores each call against realized forward returns: per-action avg forward return, directional hit-rate, and BUY edge vs baseline.
Look-ahead-bias guard: --anonymize masks the ticker so the model can't recall a known stock's trajectory (cf. arXiv 2510.07920 Profit Mirage, 2601.13770 Look-Ahead-Bench); warns about pre-cutoff memorization.
Honest scoping: only reconstructable price/technical context is fed; live-only alt-data (social, Congress, options flow, VIX, fundamentals) is passed empty.

Verification

Slippage: returns drop monotonically with trade count (NVDA −0.06% / 12 trades, MSFT −0.02% / 5 trades).
eval_llm.py: end-to-end smoke test (NVDA, 2 as-of dates) ran clean — point-in-time indicators → real brain → forward-return scoring → report.

Not included (follow-ups)

TradingAgents-style explicit risk-debate round.
Crisis-scenario stress sizing.

🤖 Generated with Claude Code

backtest.py: model adverse slippage (0.05%/side, override via params[slippage]) at all fill points — buys fill higher, sells/stops/targets/EOD lower. Surface commission+slippage in the run header. Backtests were commission-only and thus optimistic on net fills. eval_llm.py (new): offline harness that replays historical as-of dates, rebuilds point-in-time price/technical indicators, calls the real debate_decide brain, and scores each call against realized forward returns (per-action avg return, directional hit-rate, BUY edge vs baseline). The LLM debate layer was previously validated only by live online loops. Includes a look-ahead-bias guard (--anonymize) and feeds only reconstructable context (live-only alt-data passed empty).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backtest slippage modeling + offline point-in-time LLM signal eval#2

Backtest slippage modeling + offline point-in-time LLM signal eval#2
biswajeetdev wants to merge 1 commit into
mainfrom
feat/performance-report

biswajeetdev commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

biswajeetdev commented Jun 16, 2026

What

1. Slippage modeling in backtest.py

2. eval_llm.py — offline point-in-time eval of the LLM debate brain (new)

Verification

Not included (follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Slippage modeling in `backtest.py`

2. `eval_llm.py` — offline point-in-time eval of the LLM debate brain (new)