Given a controlled synthetic signal series, can a simple, transparent candidate signal (a moving-average crossover) be shown to add value over a buy-and-hold baseline — and, just as importantly, can the whole investigation be made reproducible, validated, bias-aware, robustness-tested, and decision-ready?
The emphasis is deliberately on the research workflow, not on "winning". The output is evidence and a clear interpretation, whatever the sign of the result. "Backtesting" here means a controlled research experiment, not trading advice or a deployment-ready system.
Data is generated, never downloaded. We use a geometric random walk:
value_t = value_{t-1} * exp(drift + volatility * z_t)
with z_t drawn from a seeded NumPy generator. A fixed seed means the
series — and therefore every chart, CSV, and KPI — is byte-stable across runs.
We call it a synthetic signal series (not an asset price) to keep the project
firmly in research territory and away from any market-prediction framing.
Synthetic data is the default. An optional CSV loader
(neuroquant.data.load_csv_series) lets an analyst point the same pipeline at
their own local historical data (a timestamp column plus close). It reads a
local file only — no network, no API — and applies the same data-quality gates
(sorted, de-duplicated timestamps; no missing or non-positive close).
Before any analysis runs, inputs must pass validate_price_frame:
- input is a non-empty
DataFrame, - the required
closecolumn is present, - no missing /
NaNvalues, - all values strictly positive (a level series cannot be ≤ 0),
- enough rows to support the requested rolling windows.
Configuration is guarded by validate_window_config: windows must be positive
integers and the short window must be strictly less than the long window,
otherwise a crossover is meaningless. Every failure raises a ValidationError
with an actionable message.
features.py builds small, causal features (no forward shifts): returns,
moving averages and price-to-MA distance, multi-horizon momentum, rolling
z-scores, rolling volatility, drawdown-from-rolling-high, and volatility regime
labels. Because no feature looks ahead, building the feature frame introduces no
leakage; positions derived from features are additionally lagged one bar.
The lab compares four long-or-flat candidate families (signals.py):
- trend — fast/slow moving-average crossover,
- momentum — positive trailing momentum,
- mean reversion — long when the z-score is oversold,
- composite — a transparent average of the three signals above, thresholded (a literal composite of simple signals, not a machine-learning model).
An optional volatility filter damps exposure when rolling volatility exceeds a trailing (expanding-quantile) threshold, so the filter itself never uses future data.
For each configuration we:
- build the family's target position from past-only features,
- shift the position by one bar so a decision uses only prior information — this removes look-ahead bias, the most common silent error in this kind of analysis,
- apply a simple transaction cost + slippage on every position change,
- compute strategy returns and a cumulative equity curve.
Candidate selection runs across all families on the in-sample period; the winner is then judged out-of-sample and in walk-forward folds.
Every strategy is measured against a buy-and-hold baseline over the same window. Without a benchmark, a return number is uninterpretable; with one, we can state plainly whether the rule added value.
We evaluate a grid of (short, long) window pairs and collect the KPIs into a tidy table sorted by Sharpe. This shows whether any apparent edge is robust across settings or just a lucky single cell — visualised as a Sharpe heatmap.
Choosing parameters and judging them on the same data is the most common way to fool yourself. We split the series chronologically into an in-sample (training) period and a later, non-overlapping out-of-sample (test) period. The configuration is selected on the in-sample sweep and then evaluated on the out-of-sample period it never saw. Both reads are reported side by side.
A single split is one trial. Walk-forward validation repeats the select-then-evaluate cycle over rolling windows: pick the best configuration on a training window, evaluate it on the next window, then step forward. The result is a per-fold table comparing in-sample and out-of-sample Sharpe — a direct read on whether an in-sample choice tends to survive on unseen data.
To gauge how fragile an outcome is, we bootstrap the realised per-bar return stream (resampling with replacement) into many alternative equity paths and summarise the spread: median, 5th and 95th percentile total return, the probability of a losing path, and the drawdown distribution. This is a descriptive robustness diagnostic, not a forecast.
Beyond a single headline number we attribute and stress the selected result:
- Regime attribution (
regime.py) breaks performance down by volatility regime (low / normal / high / stress) so we can see whether the result came from calm or turbulent periods. - Cost sensitivity (
stress.py) re-runs the selected rule across a ladder of transaction-cost assumptions; by construction higher costs never improve a net result. - Stress diagnostics (
stress.py) apply clearly-labelled synthetic transforms — higher costs, a volatility shock, an adverse drift, and amplified downside — to probe fragility. These are diagnostics, not predictions.
| Metric | Why it is included |
|---|---|
total_return / baseline_return |
Headline outcome vs the benchmark |
excess_return |
Strategy return net of the baseline |
annualized_return |
Compounded growth on a one-year scale |
annualized_volatility |
Standardised risk scale |
sharpe_ratio |
Reward per unit of total risk |
sortino_ratio |
Reward per unit of downside risk |
calmar_ratio |
Return per unit of worst-case drawdown |
information_ratio |
Active return per unit of tracking error vs baseline |
max_drawdown |
Worst-case peak-to-trough decline |
active_days |
How often the signal is exposed |
trade_count / turnover |
Activity (cost drivers) |
correlation_to_baseline |
How differentiated the signal really is |
win_loss_ratio |
Shape of the return distribution |
- Synthetic data has no real-world structure; results do not generalise to any market and are not meant to.
- A single random seed is one realisation; the walk-forward and Monte Carlo steps probe stability, but broad conclusions would still need many seeds.
- The cost / slippage model is intentionally simple (a flat per-change fraction).
- The signal family is deliberately minimal to keep focus on the workflow.
Using only seeded synthetic data keeps the project reproducible, removes any dependence on third-party APIs or keys, and makes clear that the goal is a demonstration of analytics craft — not forecasting markets or giving advice.
- Are inputs validated before they are trusted?
- Is the experiment fair (benchmark, no look-ahead, selection separated from evaluation)?
- Are the metrics sensible, documented, and honestly reported?
- Is the result tested out-of-sample and for robustness?
- Is everything reproducible from a seed?
- Is the conclusion communicated clearly to a non-technical audience?
This project applies widely taught analytics best practices rather than any proprietary method:
- Input validation / data-quality gates before analysis.
- Reproducibility via fixed random seeds and committed outputs.
- Benchmark / baseline comparison so results are interpretable.
- Avoiding look-ahead bias by lagging signals relative to information.
- Out-of-sample evaluation and walk-forward validation to separate parameter selection from judgement.
- Bootstrap / Monte Carlo robustness analysis to gauge outcome stability.
- Scenario / sensitivity analysis across a parameter grid.
- Clear visual and written reporting for decision-makers.
(No external sources were browsed and no citations are fabricated; the above are standard, well-established analytics concepts.)