Dimension-constrained genetic programming with Bayesian model evidence for automated physical law discovery.
Given 10–20 noisy data points, FormulaForge discovers the underlying mathematical formula — without human intervention. It re-derived the pendulum equation, Newton's law of cooling, and the free-fall formula from raw data alone.
Input: 15 (L, T) pairs with ±2% noise. Output: T = 2π·√(L/g)
Every variable carries physical dimensions [M, L, T, Θ]. The expression grammar enforces dimensional consistency at the syntax level — sin(mass) or length + time can never be generated, eliminating 90%+ of the search space before fitting begins.
Formulas are scored by marginal likelihood (BIC approximation), not R². This automatically penalizes complexity — a 3-parameter formula must fit substantially better than a 1-parameter alternative. Zero hyperparameters to tune.
Population initialized with known physical templates (linear, sqrt, exponential decay, power-law). Tournament selection + dimension-safe mutation + constant folding. Converges in 50–100 generations.
Data is split 80/20 train/validation. The GP fits on training data only. Final report shows both train and validation R² — a gap >15% triggers an explicit overfitting warning.
| Dataset | Ground Truth | FormulaForge Discovery | R² |
|---|---|---|---|
| Free Fall | h = ½g·t² | 0.051·g²·t² |
100% |
| Pendulum | T = 2π·√(L/g) | 6.283·√(L/g) |
100% |
| Newton Cooling | T = Ta + ΔT·e^(−kt) | 22.2 + 73.4·exp(−0.081·t) |
100% |
| Exponential | y = a·e^(bx) | 1.33·exp(0.65·x) |
98.5% |
| Power Law | y = a·x^b | 1.89·x^{1.54} |
99.4% |
| Quadratic | y = ax² + bx + c | 1.45·x² − 0.97·x + 1.90 |
99.9% |
| Linear | y = ax + b | 2.47·x − 1.31 |
100% |
| Multivar | z = a·x + b·y + c | 2.51·x − 1.83·y + 3.14 |
99.9% |
| Sine Wave | y = A·sin(ωx + φ) | 2.68·sin(1.79x) + 1.46·cos(1.79x) |
99.7% |
All discovered from 12–20 data points with ±2–5% Gaussian noise. 8/9 converge to the correct functional form. Multivariable supported. Sine with phase uses trig identity to reduce 3 nonlinear params to 1.
pip install -r requirements.txt
# CLI
python cli.py --dataset pendulum --generations 100
# Streamlit
streamlit run app.pynumpy, scipy, matplotlib, streamlit
formulaforge/
├── engine.py # Dims, Node, grammar, OLS/DE fitting, BIC evidence, constant folding
├── gp.py # Seeded population, tournament selection, mutation, evolve()
├── datasets.py # 8 synthetic datasets (pendulum, cooling, freefall, ...)
├── app.py # Streamlit interactive UI
└── cli.py # Terminal runner
- Dimension-constrained syntax — physical units as a grammar, not a post-hoc check
- BIC as fitness — automatic Occam's razor, no regularization hyperparameters
- Hybrid fitting — pure OLS for linear params, grid search + OLS for 1 nonlinear param, DE for 2+ nonlinear
- Trig identity linearization —
sin(ωx+φ)→c₁·sin(ωx) + c₂·cos(ωx), reducing 3 nonlinear to 1 - Symbolic constant folding —
cos(cos(exp(π)))·9.8→0.0automatically - Native power operator —
x^pdisplayed cleanly, withexp(p·log(x))→x^psimplification - Known constant integration — π, g, e available as non-parametric leaves
- Exponential fitting sensitive to data range (quadratic overfits on narrow intervals)
- Tree depth limited to 6 for search efficiency
- DE fallback (2+ nonlinear params) slower and less reliable than grid search
MIT