Skip to content

HiChat-fog/formula-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FormulaForge

Tests

Dimension-constrained genetic programming with Bayesian model evidence for automated physical law discovery.

Given 10–20 noisy data points, FormulaForge discovers the underlying mathematical formula — without human intervention. It re-derived the pendulum equation, Newton's law of cooling, and the free-fall formula from raw data alone.

Input: 15 (L, T) pairs with ±2% noise. Output: T = 2π·√(L/g)


How It Works

1. Dimension-Constrained Grammar

Every variable carries physical dimensions [M, L, T, Θ]. The expression grammar enforces dimensional consistency at the syntax level — sin(mass) or length + time can never be generated, eliminating 90%+ of the search space before fitting begins.

2. Bayesian Model Evidence

Formulas are scored by marginal likelihood (BIC approximation), not R². This automatically penalizes complexity — a 3-parameter formula must fit substantially better than a 1-parameter alternative. Zero hyperparameters to tune.

3. Seeded Genetic Programming

Population initialized with known physical templates (linear, sqrt, exponential decay, power-law). Tournament selection + dimension-safe mutation + constant folding. Converges in 50–100 generations.

4. Overfitting Prevention

Data is split 80/20 train/validation. The GP fits on training data only. Final report shows both train and validation R² — a gap >15% triggers an explicit overfitting warning.


Results

Dataset Ground Truth FormulaForge Discovery
Free Fall h = ½g·t² 0.051·g²·t² 100%
Pendulum T = 2π·√(L/g) 6.283·√(L/g) 100%
Newton Cooling T = Ta + ΔT·e^(−kt) 22.2 + 73.4·exp(−0.081·t) 100%
Exponential y = a·e^(bx) 1.33·exp(0.65·x) 98.5%
Power Law y = a·x^b 1.89·x^{1.54} 99.4%
Quadratic y = ax² + bx + c 1.45·x² − 0.97·x + 1.90 99.9%
Linear y = ax + b 2.47·x − 1.31 100%
Multivar z = a·x + b·y + c 2.51·x − 1.83·y + 3.14 99.9%
Sine Wave y = A·sin(ωx + φ) 2.68·sin(1.79x) + 1.46·cos(1.79x) 99.7%

All discovered from 12–20 data points with ±2–5% Gaussian noise. 8/9 converge to the correct functional form. Multivariable supported. Sine with phase uses trig identity to reduce 3 nonlinear params to 1.


Quick Start

pip install -r requirements.txt

# CLI
python cli.py --dataset pendulum --generations 100

# Streamlit
streamlit run app.py

Requirements

numpy, scipy, matplotlib, streamlit

Architecture

formulaforge/
├── engine.py     # Dims, Node, grammar, OLS/DE fitting, BIC evidence, constant folding
├── gp.py         # Seeded population, tournament selection, mutation, evolve()
├── datasets.py   # 8 synthetic datasets (pendulum, cooling, freefall, ...)
├── app.py        # Streamlit interactive UI
└── cli.py        # Terminal runner

Key Innovations

  1. Dimension-constrained syntax — physical units as a grammar, not a post-hoc check
  2. BIC as fitness — automatic Occam's razor, no regularization hyperparameters
  3. Hybrid fitting — pure OLS for linear params, grid search + OLS for 1 nonlinear param, DE for 2+ nonlinear
  4. Trig identity linearizationsin(ωx+φ)c₁·sin(ωx) + c₂·cos(ωx), reducing 3 nonlinear to 1
  5. Symbolic constant foldingcos(cos(exp(π)))·9.80.0 automatically
  6. Native power operatorx^p displayed cleanly, with exp(p·log(x))x^p simplification
  7. Known constant integration — π, g, e available as non-parametric leaves

Limitations

  • Exponential fitting sensitive to data range (quadratic overfits on narrow intervals)
  • Tree depth limited to 6 for search efficiency
  • DE fallback (2+ nonlinear params) slower and less reliable than grid search

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages